#129 Latest endpoint: group by additional fields
Closed 6 years ago by jskladan. Opened 6 years ago by gnaponie.
taskotron/ gnaponie/resultsdb scenario-filter  into  develop

file modified
+10 -5
@@ -179,7 +179,7 @@ 

          }

  

  

- ## Get a list of current Results for a specified filter _FIXME: reword to make more sense_ [GET /results/latest{?keyval,testcase,groups,since}]

+ ## Get a list of current Results for a specified filter _FIXME: reword to make more sense_ [GET /results/latest{?keyval,testcases,groups,since}]

  

  Especially with automation in mind, a simpe query to get the latest `Results` of all the `Testcases` based on a filter

  makes a lot of sense. For example Koji could be interested in data like "All current results for the `koji_build` `koschei-1.7.2-1.fc24`", without
@@ -189,21 +189,26 @@ 

  Only `Testcases` with at least one `Result` that meet the filter are present - e.g. if ResultsDB contained `dist.rpmlint` and `dist.rpmgrill`

  `Testcases`, but there was only a `dist.rpmlint` `Result` for the `koschei-1.7.2-1.fc24` `koji_build`, just `dist.rpmlint`'s `Result` would be returned.

  

+ An additional available parameter is `_distinct_on`, if specified allows the user to group by additional fields (example: `scenario`).

+ 

  + Parameters

      + keyval (string)

          - Any key-value pair in `Result.data`. Replace `keyval` with the key's name: `...&item=koschei-1.7.2-1.fc24`

          - Multiple values can be provided, separate by commas to get `or` filter based on all the values provided: `...&arch=x86_64,noarch`

          - `like` filter with `*` as wildcards: `...&item:like=koschei*fc24*`

          - Multiple key-value pairs provide `and` filter, e.g. to search for all `Results` with `item` like `koschei*fc24*` and `arch` being either `noarch` or `x86_64`: `...&item:like=koschei*fc24*&arch=noarch`

-     + testcase (string, optional)

+     + testcases (string, optional)

          - Use to narrow down `Testcases` of interest. By default, all `Testcases` are searched for `Results`

-         - Multiple values can be provided, separate by coma to get `or` filter based on all the values provided: `...&testcase=dist.rpmlint,dist.depcheck`

-         - `like` filter with `*` as wildcards: `...&testcase:like=dist.*`

+         - Multiple values can be provided, separate by comma to get `or` filter based on all the values provided: `...&testcases=dist.rpmlint,dist.depcheck`

+         - `like` filter with `*` as wildcards: `...&testcases:like=dist.*`

      + groups: `27f94e36-62ec-11e6-83fd-525400d7d6a4` (string, optional)

-         - Multiple values can be provided, separate by commas to get `or` filter based on all the values provided: `...&group=uuid1,uuid2`

+         - Multiple values can be provided, separate by commas to get `or` filter based on all the values provided: `...&groups=uuid1,uuid2`

      + since: `2016-08-15T13:00:00` (string)

          Date (or datetime) in ISO8601 format.

          To specify range, separate start and end date(time) by comma: `...&since=2016-08-14,2016-08-15T13:42:57`

+     + _distinct_on: `scenario` (string, optional)

+ 	- The value can be any `key` in `Result.data`.  Example: `...&_distinct_on=scenario`

+ 	- Multiple values can be provided, separate by comma. Example: `...&_distinct_on=scenario,item`

  

  + Request `.../results/latest?item=koschei-1.7.2-1.fc24&type=koji_build`

      + Parameters

file modified
+87 -45
@@ -26,6 +26,8 @@ 

  from flask_restful import reqparse

  

  from sqlalchemy.orm import exc as orm_exc

+ from sqlalchemy import distinct

+ from sqlalchemy.sql import text

  from werkzeug.exceptions import HTTPException

  from werkzeug.exceptions import BadRequest as JSONBadRequest

  
@@ -319,22 +321,7 @@ 

  # =============================================================================

  #                                     RESULTS

  # =============================================================================

- 

- def select_results(since_start=None, since_end=None, outcomes=None, groups=None, testcases=None, testcases_like=None, result_data=None, _sort=None):

-     # Checks if the sort parameter specified in the request is valid before querying.

-     # Sorts by submit_time in a descending order if the sort parameter is absent or invalid.

-     query_sorted = False

-     if _sort:

-         sort_match = re.match(r'^(?P<order>asc|desc):(?P<column>.+)$', _sort)

-         if sort_match:

-             if sort_match.group('column') == 'submit_time':

-                 sort_order = {'asc': db.asc, 'desc': db.desc}[sort_match.group('order')]

-                 sort_column = getattr(Result, sort_match.group('column'))

-                 q = db.session.query(Result).order_by(sort_order(sort_column))

-                 query_sorted = True

-     if not query_sorted:

-         q = db.session.query(Result).order_by(db.desc(Result.submit_time))

- 

+ def filter_results(q, since_start=None, since_end=None, outcomes=None, groups=None, testcases=None, testcases_like=None, result_data=None):

      # Time constraints

      if since_start:

          q = q.filter(Result.submit_time >= since_start)
@@ -388,6 +375,26 @@ 

      return q

  

  

+ def select_results(since_start=None, since_end=None, outcomes=None, groups=None, testcases=None, testcases_like=None, result_data=None, _sort=None):

+     # Checks if the sort parameter specified in the request is valid before querying.

+     # Sorts by submit_time in a descending order if the sort parameter is absent or invalid.

+     query_sorted = False

+     if _sort:

+         sort_match = re.match(r'^(?P<order>asc|desc):(?P<column>.+)$', _sort)

+         if sort_match:

+             if sort_match.group('column') == 'submit_time':

+                 sort_order = {'asc': db.asc, 'desc': db.desc}[sort_match.group('order')]

+                 sort_column = getattr(Result, sort_match.group('column'))

+                 q = db.session.query(Result).order_by(sort_order(sort_column))

+                 query_sorted = True

+     if not query_sorted:

+         q = db.session.query(Result).order_by(db.desc(Result.submit_time))

+ 

+     q = filter_results(q, since_start, since_end, outcomes, groups, testcases, testcases_like, result_data)

+ 

+     return q

+ 

+ 

  def __get_results_parse_args():

      retval = {"args": None, "error": None, "result_data": None}

      try:
@@ -417,6 +424,7 @@ 

      args['testcases'] = [tc.strip() for tc in args['testcases'].split(',') if tc.strip()]

      args['testcases:like'] = [tc.strip() for tc in args['testcases:like'].split(',') if tc.strip()]

      args['groups'] = [group.strip() for group in args['groups'].split(',') if group.strip()]

+     args['_distinct_on'] = [_distinct_on.strip() for _distinct_on in args['_distinct_on'].split(',') if _distinct_on.strip()]

      retval['args'] = args

  

      # find results_data with the query parameters
@@ -456,39 +464,72 @@ 

          return p['error']

  

      args = p['args']

+     if not args['_distinct_on']:

+         q = select_results(

+             since_start=args['since']['start'],

+             since_end=args['since']['end'],

+             groups=args['groups'],

+             testcases=args['testcases'],

+             testcases_like=args['testcases:like'],

+             result_data=p['result_data'],

+             _sort=args['_sort'],

+         )

  

-     q = select_results(

-         since_start=args['since']['start'],

-         since_end=args['since']['end'],

-         groups=args['groups'],

-         testcases=args['testcases'],

-         testcases_like=args['testcases:like'],

-         result_data=p['result_data'],

-         _sort=args['_sort'],

-     )

- 

-     # Produce a subquery with the same filter criteria as above *except*

-     # test case name, which we group by and join on.

-     sq = select_results(

-         since_start=args['since']['start'],

-         since_end=args['since']['end'],

-         groups=args['groups'],

-         result_data=p['result_data'],

-         )\

-         .order_by(None)\

-         .with_entities(

-             Result.testcase_name.label('testcase_name'),

-             db.func.max(Result.submit_time).label('max_submit_time'))\

-         .group_by(Result.testcase_name)\

-         .subquery()

-     q = q.join(sq, db.and_(Result.testcase_name == sq.c.testcase_name,

-                            Result.submit_time == sq.c.max_submit_time))

+         # Produce a subquery with the same filter criteria as above *except*

+         # test case name, which we group by and join on.

+         sq = select_results(

+             since_start=args['since']['start'],

+             since_end=args['since']['end'],

+             groups=args['groups'],

+             result_data=p['result_data'],

+             )\

+             .order_by(None)\

+             .with_entities(

+                 Result.testcase_name.label('testcase_name'),

+                 db.func.max(Result.submit_time).label('max_submit_time'))\

+             .group_by(Result.testcase_name)\

+             .subquery()

+         q = q.join(sq, db.and_(Result.testcase_name == sq.c.testcase_name,

+                                Result.submit_time == sq.c.max_submit_time))

+ 

+         results = q.all()

+ 

+         return jsonify(dict(

+             data=[SERIALIZE(o) for o in results],

+         ))

+ 

+     testcases = args.get('testcases', None)

+     testcases_like = args.get('testcases:like', None)

+     since_start = args['since'].get('start', None)

+     since_end = args['since'].get('end', None)

+     groups = args.get('groups', None)

+     distinct_on = args['_distinct_on'] 

+ 

+     if not any([testcases, testcases_like, since_start, since_end, groups, p['result_data']]):

+         return jsonify({'message': ("Please, provide at least one "

+                                     "filter beside '_distinct_on'")}), 400

+ 

+     q = db.session.query(Result)

+     q = filter_results(q, since_start=since_start, since_end=since_end,

+                        groups=groups, testcases=testcases,

+                        testcases_like=testcases_like, result_data=p['result_data'])

+ 

+     values_distinct_on = ['result.testcase_name']

+     for i in distinct_on:

+         name = 'result_data_{}'.format(i)

+         alias = db.aliased(

+                 db.session.query(ResultData).filter(ResultData.key == i).subquery(), name=name)

+         q = q.outerjoin(alias)

+         values_distinct_on.append('{}.value'.format(name))

+     q = q.distinct(*values_distinct_on).order_by(

+             text(', '.join(values_distinct_on)), db.desc(text('result.submit_time')))

  

      results = q.all()

- 

-     return jsonify(dict(

+     results = dict(

          data=[SERIALIZE(o) for o in results],

-     ))

+     )

+     results['data'] = sorted(results['data'], key=lambda x: x['submit_time'], reverse=True)

+     return jsonify(results)

  

  

  RP['get_results'] = reqparse.RequestParser()
@@ -498,6 +539,7 @@ 

  RP['get_results'].add_argument('outcome', location='args')

  RP['get_results'].add_argument('groups', default="", location='args')

  RP['get_results'].add_argument('_sort', default="", location='args')

+ RP['get_results'].add_argument('_distinct_on', default="", location='args')

  # TODO - can this be done any better?

  RP['get_results'].add_argument('testcases', default="", location='args')

  RP['get_results'].add_argument('testcases:like', default="", location='args')

@@ -857,6 +857,117 @@ 

          assert data['data'][1]['testcase']['name'] == self.ref_testcase_name

          assert data['data'][1]['outcome'] == "FAILED"

  

+     def test_get_results_latest_distinct_on(self):

+         self.helper_create_testcase()

+ 

+         self.helper_create_result(outcome="PASSED", data={'scenario': 'scenario1'}, testcase=self.ref_testcase_name)

+         self.helper_create_result(outcome="PASSED", data={'scenario': 'scenario2'}, testcase=self.ref_testcase_name)

+         r = self.app.get('/api/v2.0/results/latest?testcases=' + self.ref_testcase_name + '&_distinct_on=scenario')

+         data = json.loads(r.data)

+         assert len(data['data']) == 2

+         assert data['data'][0]['data']['scenario'][0] == 'scenario2'

+         assert data['data'][1]['data']['scenario'][0] == 'scenario1'

+         r = self.app.get('/api/v2.0/results/latest?testcases=' + self.ref_testcase_name)

+         data = json.loads(r.data)

+         assert len(data['data']) == 1

+         assert data['data'][0]['data']['scenario'][0] == 'scenario2'

+ 

+     def test_get_results_latest_distinct_on_more_specific_cases(self):

+         '''

+             | id | testcase | scenario |

+             |----|----------|----------|

+             | 1  | tc_1     | s_1      |

+             | 2  | tc_2     | s_1      |

+             | 3  | tc_2     | s_2      |

+             | 4  | tc_3     |          |

+         '''

+         self.helper_create_result(outcome="PASSED", data={

+             'item': 'grub',

+             'scenario': 's_1'}, testcase='tc_1')

+         self.helper_create_result(outcome="PASSED", data={

+             'item': 'grub',

+             'scenario': 's_1'}, testcase='tc_2')

+         self.helper_create_result(outcome="PASSED", data={

+             'item': 'grub',

+             'scenario': 's_2'}, testcase='tc_2')

+         self.helper_create_result(outcome="PASSED", data={

+             'item': 'grub'}, testcase='tc_3')

+         r = self.app.get('/api/v2.0/results/latest?item=grub&_distinct_on=scenario')

+         data = json.loads(r.data)

+         assert len(data['data']) == 4

+         for i, result in enumerate(reversed(data['data'])):

+             assert result['id'] == (i+1)

+ 

+         r = self.app.get('/api/v2.0/results/latest?item=grub')

+         data = json.loads(r.data)

+         assert len(data['data']) == 3

+         assert data['data'][0]['id'] == 4

+         assert data['data'][1]['id'] == 3

+         assert data['data'][2]['id'] == 1

+ 

+         '''

+             | id | testcase | scenario |

+             |----|----------|----------|

+             | 1  | tc_1     | s_1      |

+             | 2  | tc_2     | s_1      |

+             | 3  | tc_2     | s_2      |

+             | 4  | tc_3     |          |

+             | 5  | tc_1     |          |

+         '''

+         self.helper_create_result(outcome="PASSED", data={

+             'item': 'grub'}, testcase='tc_1')

+         r = self.app.get('/api/v2.0/results/latest?item=grub&_distinct_on=scenario')

+         data = json.loads(r.data)

+         assert len(data['data']) == 5

+         for i, result in enumerate(reversed(data['data'])):

+             assert result['id'] == (i+1)

+ 

+         r = self.app.get('/api/v2.0/results/latest?item=grub')

+         data = json.loads(r.data)

+         assert len(data['data']) == 3

+         assert data['data'][0]['id'] == 5

+         assert data['data'][1]['id'] == 4

+         assert data['data'][2]['id'] == 3

+ 

+         '''

+             | id | testcase | scenario |

+             |----|----------|----------|

+             | 1  | tc_1     | s_1      |

+             | 2  | tc_2     | s_1      |

+             | 3  | tc_2     | s_2      |

+             | 4  | tc_3     |          |

+             | 5  | tc_1     |          |

+             | 6  | tc_1     | s_1      |

+         '''

+         self.helper_create_result(outcome="PASSED", data={

+             'item': 'grub', 'scenario': 's_1'}, testcase='tc_1')

+         r = self.app.get('/api/v2.0/results/latest?item=grub&_distinct_on=scenario')

+         data = json.loads(r.data)

+         assert len(data['data']) == 5

I'm getting AssertionError: assert 6 == 5 here.

+         for i, result in enumerate(reversed(data['data'])):

+             assert result['id'] == (i+2) # 2, 3, 4, 5, 6

+         r = self.app.get('/api/v2.0/results/latest?item=grub')

+         data = json.loads(r.data)

+         assert len(data['data']) == 3

+         assert data['data'][0]['id'] == 6

+         assert data['data'][1]['id'] == 4

+         assert data['data'][2]['id'] == 3

+ 

+     def test_get_results_latest_distinct_on_with_scenario_not_defined(self):

+         self.helper_create_testcase()

+     

+         self.helper_create_result(outcome="PASSED", testcase=self.ref_testcase_name)

+         self.helper_create_result(outcome="PASSED", testcase=self.ref_testcase_name)

+         r = self.app.get('/api/v2.0/results/latest?testcases=' + self.ref_testcase_name + '&_distinct_on=scenario')

+         data = json.loads(r.data)

+         assert len(data['data']) == 2

+ 

+     def test_get_results_latest_distinct_on_wrong_params(self):

+         r = self.app.get('/api/v2.0/results/latest?_distinct_on=scenario')

+         data = json.loads(r.data)

+         assert r.status_code == 400

+         assert data['message'] == "Please, provide at least one filter beside '_distinct_on'"

+ 

      def test_message_publication(self):

          self.helper_create_result()

          plugin = resultsdb.messaging.DummyPlugin

The results/latest endpoint groups the results by testcase. Sometimes
it might be useful to group by additional fields in the ResultData
table. This new version of the API endpoint gets an additional
parameter "group_by" that takes a list of fields (separated by ",")
on which will be done the group by.
This new feature is not available in case no other filtering is
provided. In that case the API returns an 400 error, since it wouldn't
be possible to retrieve the requested results with an efficient request.

Cool, I gave it a quick read-through, and it looks OK. I'll try to test it out early next week.

I only have couple of nitpicks, ATM, so I'll share those in the review.

Thanks!

Oh! I totally forgot about the sorting..... please wait for the review.

@jskladan I don't see any information about sorting the results in the API documentation. And in the code it seems to me that the sorting is only for submit_time. Is that correct? Should I allow the user to sort for additional fields? We don't really care about it, only the submit_time.

rebased onto 2ad798b6368a1672864fbdf8af84903b83b26807

6 years ago

I've put the sorting on the submit_time. Please let me know if some custom sorting is needed.

@gnaponie I honestly don't even remember, why I implemented the _sort field, but it is in the /latest only so one can use the same exact filtering params as with the /results endpoint. Non necessarily making much sense there, though, and we most certainly don't use it (AFAIK) anywhere in our stack.

Especially given the use-case of being interested in the "latest results", I don't think it's necessary to act on it in the /latest endpoints group_by scenario, and I'll maybe even change the code to ignore it alltogether in the /latest, as it is kind of a special citizen.

Hopefully that answered your questions?

Could you please do this
instead of: if <cond>: .................................... else: ...
the other way around: if not <cond>: .. and then the rest of code unindented?

Maybe create a separate test for this negative test case, what do you think?

From __get_results_parse_args the result_data is either None, or dict. So it should be possible to use if p.get('result_data') == None or p['result_data'].get('group_by', None) == None:
Is it right?

@ jskladan it does. Thank you.
So we can consider this PR complete. Looking forward for a review.
Thanks

UPDATE: This is even simpler:if not (p.get('result_data') and p['result_data'].get('group_by'):

At least these three:
since_start=args['since']['start'], since_end=args['since']['end'], groups=args['groups'],
seem to be re-used further below in the code. I suggest having these as a somedict defined somewhere soon. Then just append the dict to each call where it is used, as **somedict.

Is it reasonable?

Short form: distinct_on += (alias.value)

@gnaponie Awesome. The biggest change I'd propose ATM is renaming the URL param from group_by to _distinct_on - It IMO better represents the nature of what it does. I'll go through the code shortly, to add some comments, and will test the performance later in the week.

I'm really sorry, this absolutely is my fault here, but he RP['get_results_latest'] request parser is not really used in the get_results_latest(): method, but rather the RP['get_results'] is (is you can see by the first line in get_results_latests() (p = __get_results_parse_args()). And it is the reason why you need to fish the group_by from the result_data later on, instead of being able to use args['group_by'] in the same fashion args['testcases'], args['groups'] and so on get used in the method's body.

Please feel free to remove the whole RP['get_results_latest'] section, and modify the RP['get_results'] instead.

I'd also recommend chaging the group_by to _group_by (or _distinct_on, which I'd like a bit more), as we have the precedent of having reserved underscore-prepended URL params.

Please see my comment at line 64 first.
When you make the proposed change, you'll be able to do if not args['group_by'] instead, here, and in all the other instances (I won't comment on each line separately)

While you absolutely are right, this could be done, I don't really see the reason to exchange saving four (?) lines, for less readable code.

In fact, the else is not even needed, since the if block ends in return, there could just be the un-intended return clause from the else block at the end of the method.

+1 to the any() though, great point!

The else block is not really needed here, as the if-clause above ends with a return statement. You can easily just remove this line and uninted the rest of the code

@jskladan I meant exactly the same by the rest of code unindented, but you probably wrote it more understandable. :)

I'd maybe add more asserts here or even check the impact on resulting data if group_by is used versus not used.

@fvivaldi /me got confused by the change from if <cond> to if not <cond> in your original comment :)

I'm not really sure what the type of alias.value is, but the whole line is in fact equal not to distinct_on += (alias.value) as @fvivaldi proposes, but to distinct_on += alias.value, because (alias.value) is not a one-member tuple, but an expression, that gets evaluated to alias.value's actual value.

So, depending on what alias.value really is inside, you either should do:
* distinct_on += alias.value if alias.value is a tuple
* distinct_on += (alias.value, ) (note the , before the closing bracket) if alias.value is not a tuple

Done with the review / recommendations. I'd prefer having the negative test case separated and have some other asserts checking the data in the positive test or even comparing the data when group_by is used or not. I think it's good to test the actual expected behavior of the group_by.

This else block could easily be unintended, as the if clause it belongs to ends with return statement, but as @fvivaldi proposed above, I'd rather see the "group_by filter is missing some friends" check and the error message moved above, and have the contents of the above if-block unintended after the check 'passes`

@fvivaldi - well, you actually are correct in what you said. Your original comment on the line really is the short version of what is written there.
My comment is more for @gnaponie, as he should decide/know what "is supposed to be right".

Because either he wants to add a one-member tuple, and then (alias.value, ) should be used, or the code works as is, and then the parentheses are pointless.

@gnaponie - I've added my inline comments. Let me just say once more, that I'd like the group_by parameter to be renamed to either _group_by or _distinct_on (which I think better describes what is going on). In either way, lets change it to have the underscore prefix, as there already is a precedent of using underscore prefixed params for "reserved" purposes (like _sort).

Also huge thanks, @fvivaldi for reviewing the code!

distinct_on += alias.value works just fine :) parenthesis are not needed.
PS. I'm a "she" :)

rebased onto 17f5fc06045dad639e1b92e4713e317ee656452c

6 years ago

I've addressed all the comments. Thank you both for the reviews.
But I still want to check on thing...

Does this need to be in the loop body?

Does this need to be in the loop body?

Ah, never mind, just read the docs about aliased.

Can you also add a result without the screnario key to ensure it works properly?

The query doesn't work in case there's no "scenario" field in ResultData... (it returns empty data... that is not correct). I'm working on fixing that.

I think it's correct that a query with a _distinct_on clause should not return results that don't have that field.

@mikeb - I don't agree. That would, once again, mean that the consumer has to know specific details about the data stored in resultsdb, in order to do separate latest queries for different subsets of testcases to get the complete dataset.

My expected behaviour of _distinct_on is that results having the 'distinctive field'(s) set are separated, but those not having the 'distinctive field'(s) are still returned in the dataset. Maybe to reword it a bit better - having all other parameters the same, a query with distinct_on set is a superset (not subset) of a query without it. That was the initial point anyway - missing results in /latest because they get "too aggregated".

So, to be more specific, given this result dataset (let's imagine all of these share the same item, let's say grub, and the higher the id, the more recent the result is):

| id | testcase | scenario |
|----|----------|----------|
| 1  | tc_1     |  s_1     |
| 2  | tc_2     | s_1      |
| 3  | tc_2     | s_2      |
| 4  | tc_3     |          |

I would expect that /latest/item=grub&_distinct_on=scenario returns results 1, 2, 3, 4, while /latest/item=grub returns 1, 3, 4.

Having a maybe a bit more complicated scenario:

| id | testcase | scenario |
|----|----------|----------|
| 1  | tc_1     |  s_1     |
| 2  | tc_2     | s_1      |
| 3  | tc_2     | s_2      |
| 4  | tc_3     |          |
| 5  | tc_1     |          |

Expected results would be: /latest/item=grub&_distinct_on=scenario returns resuls 1, 2, 3, 4, 5 (scenario s_1 and scenario None are different), while /latest/item=grub returns 3, 4, 5.

And obviously:

| id | testcase | scenario |
|----|----------|----------|
| 1  | tc_1     |  s_1     |
| 2  | tc_2     | s_1      |
| 3  | tc_2     | s_2      |
| 4  | tc_3     |          |
| 5  | tc_1     |          |
| 6  | tc_1     |  s_1     |

/latest/item=grub&_distinct_on=scenario returns resuls 2, 3, 4, 5, 6 (6 replaces 1), while /latest/item=grub returns 3, 4, 6.

@jskladan It was pointed out to me that Fedora already has policy that checks for a null scenario, so what you say above makes sense.

@lholecek yes, as long as the goal is to perform multiple joins on the same table. it IMO is debatable, whether it needs to be implemented this specific way. If anything, the db.and_ on the line below is not needed (as there is only one condition to be and-ed) and the whole for-cycle may well be equivalent to:
alias = db.aliased(ResultData)
q = q.join(alias).filter(alias.key.in_(values_distinct_on)

but I'm not sure how the q.distinct really works here, so I'd need to test it.

After playing around with it, the answer is yes this is correct, as you need to include all the "distictive" values in the selected row. the db.and_ IMO is unnecessary, but that's about it.

I'd like to see some more complex scenarios here. Maybe get some inspiration from: https://pagure.io/taskotron/resultsdb/pull-request/129#comment-78366 ? Thanks!

I have a query that would work. I need to translate it to sqlalchemy and then update the PR :)

@jskladan It was pointed out to me that Fedora already has policy that checks for a null scenario, so what you say above makes sense.

The query doesn't work in case there's no "scenario" field in ResultData... (it returns empty data... that is not correct). I'm working on fixing that.

My guess would be that left outer join (instead of the defaul left inner join) on the _distinct_on params would be the way to go (so the 'missing' params would be still present in the join, but set to NULL) but as I said, I'm not that great of a SQL expert.

I spent some time playing around with this, and I'd like this section to be rewritten into far more readable (and also IMO producing nicer query) version:

     values_distinct_on = [Result.testcase_name]
     for i in distinct_on:
         alias = db.aliased(ResultData)
         q = q.join(alias).filter(alias.key == i)
         values_distinct_on.append(alias.value)
     q = q.distinct(*values_distinct_on)

I don't really feel that sorting the results here is necessary.

Looking more at the whole query selection, I see that you skip the "sort by submit_time in descending order" part of the data selection. I'm pretty certain that is not a right thing to do, because than the dataset is traversed in undefined order, thus the result returned by the query might not necessarily be the real latest one.
Excerpt from the official documentaion:

DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. [...] Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. [...] The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s).

I guess that you skipped the "sort" because you were then receiving errors like the one below, but this is not the right solution.

Looking more at the whole query selection, I see that you skip the "sort by submit_time in descending order" part of the data selection. I'm fairly certain that is not a right thing to do, because than the dataset is traversed in undefined order, thus the result returned by the query might not necessarily be the real latest one.
Excerpt from the official documentaion:
sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) SELECT DISTINCT ON expressions must match initial ORDER BY expressions

That's my query: https://paste.fedoraproject.org/paste/6joxDODs8o7GiYoWC0MtxQ
it's been hours I'm trying to translate that in sqlalchemy. sqlalchemy--

I have no idea how to make "left join" that's not "outer left join" with sqlalchemy...

@gnaponie Looks good. One slight nitpick - I'd rather see that sorted by result.submit_time than result.id, but that's an easy one.

Ad sqlalchemy joins - I'm pretty certain the default behaviour of both JOIN in SQL and .join() in sqlalchemy is an INNER join so you are all set there. On the other hand, if you wanted to use OUTER join (as you do here LEFT JOIN (SELECT * FROM result_data WHERE result_data.key = 'scenario') as result_data_2, as LEFT JOIN in postgresql is equivalent of LEFT OUTER JOIN), use .outerjoin() instead of join().

Hope that helped!

rebased onto 8553f5e3b8c12ad1917717bef74d40698b7126f1

6 years ago

I've rebased fixing the query and adding the test @lholecek requested.
It should be complete now.

I'd still like to see more "complex" test-scenarios here. Have a look at https://pagure.io/taskotron/resultsdb/pull-request/129#comment-78366 for inspiration - testing only the "most simple" paths gets the coverage, but I'd like to have the tests covering the "intent" of the behaviour. Thanks!

I think these could easily be moved to the filter_results call, since none of these gets re-used in any other place.

Please invert the condition here (to not any(...)), and have the body of the if-clause contain the error message (so - check for the "wrong" condition, and return error in reaction to it being "wrong"). Move the "correct path" outside of the if block.

Looks good overall, apart of the minor "issues" mentioned inline.
I'll try to have this tested on a reasonable dataset until the end of the week.
Thanks again for the awesome work!

I think I am OK with the review fixes. Maybe extending the tests, as proposed, would be good.

@jskladan I'm working on adding some more tests as you suggested. Unfortunately I've noticed that if I make a "manual" test the behavior is the expected one (with postgres). Writing a test that will run on a sqlite database (the same exact test as the manual one) won't succeed because sqlite treats "DISTINCT ON" differently (you should use GROUP BY in sqlite).
That's basically the issue: https://stackoverflow.com/questions/17223174/returning-distinct-rows-in-sqlalchemy-with-sqlite
But I couldn't apply the proposed patch in the first comment of the stackoverflow question. Because TL;DR***

Summary: can I skip the troubled test or do you have an idea on how to test that as it would be on Postgres? Unit tests with mock? Ideas?

TL;DR***: Because it will ask me to group_by also by result.id (that would of course change the behavior), but it is needed because it is in the select to connect the 2 tables: Result and ResultData.

rebased onto 26d1b8994ab4c26176e426854fb2148f79ba1e4b

6 years ago

I'm not sure I understand this suggestion. These are needed in the "if".

rebased onto b49994b6e5d467a5e34600b515cf29be00426af9

6 years ago

I've added the suggested test, besides the last 2, because, as I wrote yesterday, I couldn't add those tests (because the tests run with sqlite, and the code uses DISTINCT ON, that's a postgres thing).

@gnaponie I'm not sure which "last two" you mean, but please add as thorough testing as possible, this IMO is quite an important feature, and I'd like to have the tests ready.

Implement the "problematic" tests as if postgres was available, and I'll come up with a way of making this all work on build time, and such.

Thanks!

rebased onto ef9559a5d5d68f37ebe883443f1ba627c879d584

6 years ago

Ok! Done.
So at this point the last 2 "assert blocks" of test_get_results_latest_distinct_on_more_specific_cases will fail if you run the tests with sqlite database. They should run successfully with postgresql.

Awesome, thanks again for keeping up with me!

rebased onto dfae5391c2e3650cf0bb949cb76dcb76338aeeee

6 years ago

I'm getting AssertionError: assert 6 == 5 here.

@gnaponie Then maybe skip those tests when using sqlite for tests?

@gnaponie Then maybe skip those tests when using sqlite for tests?

...please see this comment above: https://pagure.io/taskotron/resultsdb/pull-request/129#comment-79609

@gnaponie OK, I've read both comments before but forgot about them. :)

Not sure if it's good idea to merge something that breaks tests. Would be nice to use postgres by default but AFAIK setting it up is still a bit too much work (docker helps but it's not trivial).

Would be nice to use postgres by default but AFAIK setting it up is still a bit too much work (docker helps but it's not trivial).

I think @jskladan is working on that.

@jskladan, have you had a chance to sort out running postgresql for the unit tests? Please, let us know if there's something we can do on our end to speed up getting this change in. Thanks!

I'm working with @frantisekz on a solution that allows reasonable testability on dev, and during build-time. I'm working on a solution ATM, I expect we could merge to develop wednesday at the latest. Thanks for the offer, though @lucarval. If anything comes up, I'll hold you to it :)

@gnaponie - playing around with the unit tests, I'm getting weird, random errors, and after quite some time, I'm fairly certain they have to do with a fact that you are missing sort() on the query altogether. I already commented on the issue once, but I guess you missed it, so reposting:

Looking more at the whole query selection, I see that you skip the "sort by submit_time in descending order" part of the data selection. I'm pretty certain that is not a right thing to do, because than the dataset is traversed in undefined order, thus the result returned by the query might not necessarily be the real latest one.
Excerpt from the official documentaion:

DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. [...] Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. [...] The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s).

Please make sure the query is getting sorted (in descending order) by submit_time. Othervise, there is no guarantee the query is returning correct data. Although it probably works on small dataset (which probably is why the unittests passed when you tried them locally), it is not correct to ignore sorting.

@gnaponie - this is a reproducer that I came up with, to show that the sorting matters:

def test_distinct_on_sorting(self):
    self.helper_create_testcase()     
    self.helper_create_result(outcome="PASSED", testcase=self.ref_testcase_name)                                                                                                                               
    self.helper_create_result(outcome="FAILED", testcase=self.ref_testcase_name)
    r = self.app.get('/api/v2.0/results?testcases=' + self.ref_testcase_name)                                                                                                                                  
    data = json.loads(r.data)
    assert len(data['data']) == 2 
    assert data['data'][0]['outcome'] == 'FAILED'
    assert data['data'][1]['outcome'] == 'PASSED'
    assert data['data'][0]['submit_time'] > data['data'][1]['submit_time']
    r = self.app.get('/api/v2.0/results/latest?testcases=' + self.ref_testcase_name + '&_distinct_on=scenario')
    data = json.loads(r.data)
    assert len(data['data']) == 1
    assert data['data'][0]['outcome'] == 'FAILED'

When I run the test against postgresql backend, it almost always fails on the last assert.

rebased onto 0c25d54

6 years ago

Sorry! I must have totally forget about the comment...
Could you please try to run the tests again? I think it should be fine now.
Thank you!!

PS. do you want me to add also that test?

Merged!

Thank you all for the awesome work. I'm glad for all the help and enthusiasm!

Pull-Request has been closed by jskladan

6 years ago

Great! Thank you.
Do you know when the deploy will be started?

@gnaponie I made a release, pushed it to f31/f30 for now. I expect It'll be deployed on our dev and stg instances next week, but we won't be able to deploy it on prod before it's moved from F27 to F29, so I can't give you an eta just yet.

@frantisekz could you ping @gnaponie and either @cverna or I when stg is updated?

For production we can start to plan for after the beta-freeze :)

@frantisekz Have you had a chance to update stg? How does it look for prod?

@pingou Yeah, I did that last week, I've pinged somebody on irc ( it was either @gnaponie or @cverna ).

As for production, we currently plan to bring our servers to F29 (and that means also upgrading to resultsdb containing this PR) on Tuesday next week.

@pingou Yeah, I did that last week, I've pinged somebody on irc ( it was either @gnaponie or @cverna ).
As for production, we currently plan to bring our servers to F29 (and that means also upgrading to resultsdb containing this PR) on Tuesday next week.

awesome, could you let me know once it's in prod? :)

@pingou Yeah, I did that last week, I've pinged somebody on irc ( it was either @gnaponie or @cverna ).
As for production, we currently plan to bring our servers to F29 (and that means also upgrading to resultsdb containing this PR) on Tuesday next week.

awesome, could you let me know once it's in prod? :)

Sure!