#134 RHEL7 Postgres doesn't support jsonb
Closed: Fixed 6 years ago Opened 6 years ago by puiterwijk.

Recent versions of waiverdb use the jsonb column type, and #129 is advocating for more instances, but the RHEL7 postgres is 9.2, which is older than 9.4 in which they were added.
As a consequence, the new version doesn't work with the RHEL7 postgres, and can't currently deploy in Fedora:

sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) type "jsonb" does not exist        
 [SQL: 'ALTER TABLE waiver ALTER COLUMN subject TYPE JSONB USING subject::jsonb']               
--> pre: Failed                                 
error: Pre hook failed: the pre hook failed: , aborting rollout of waiverdb/waiverdb-web-3      

Oh... :-( I should have checked this more carefully. I saw that it did support the 'json' column type and assumed that meant it also supported 'jsonb'. :-(

Unfortunately it's not as simple as just switching it from JSONB to the older JSON column type. The JSON column type does not have equality operator defined, which we need in order to select back waivers matching a particular subject...

We can make Postgres cast the JSON values to TEXT and compare/index based on the casted value... but it is not clear to me if two equivalent but differently-structured JSON objects, like:

{"type": "koji_build", "item": "python-requests-1.2.3-1.fc26"}

and

{"item": "python-requests-1.2.3-1.fc26", "type": "koji_build"}

would erroneously compare unequal after casting. I guess it depends whether Postgres does any normalization or whether it guarantees a stable ordering for keys in JSON objects, which I imagine it might not...

So I dug into this a tiny bit more. Testing against RHEL7 Postgres 9.2. The Postgres JSON column type is not much than a thin wrapper around a TEXT column, with some operations (like equality comparison) not allowed and some other operators (like indexing into JSON objects) added.

Importantly, it seems that Postgres does no normalization on the incoming JSON values, it just stores them in exactly the same encoded form that they arrive over the DBAPI. The value ultimately comes from a dict which SQLAlchemy has called json.dumps() on.

Because the waivers are accepted in POST request bodies as JSON, and then decoded from there into standard Python dicts, we absolutely cannot assume any consistent ordering of keys in the object, and in fact we can be certain they will vary due to the Python hash randomisation thing.

We had this problem before #129 (even on SQLite when using TEXT columns as a fallback) but we just didn't realise it.

There is unfortunately no JSON conversion functions in Postgres 9.2 which we could use to preprocess the JSON values to normalize them either.

The only options I can see are:

  • Make Waiverdb use a custom JSON serializer as the values are going into the database, to ensure that the whitespace and key ordering in the serialized JSON is always consistent. Then we can do indexing/comparison using a cast to TEXT. We just have to make sure that it stays consistent forevermore, and that no values are inserted into the database via other means that aren't similarly normalized.

  • Find a way to use JSONB, which solves all this mess for us.

Importantly, it seems that Postgres does no normalization on the incoming JSON values, it just stores them in exactly the same encoded form that they arrive over the DBAPI.

I forgot to explain the importance of this. It means that when you cast back to TEXT for comparison, it is sensitive to however the JSON was serialized on the way in.

waiverdb=# select '{ }'::json::text = '{}'::json::text;
 ?column? 
----------
 f
(1 row)

Unlike JSONB which does what you expect:

waiverdb=# select '{"a":1,"b":2}'::jsonb = '{"b": 2, "a": 1}'::jsonb;
 ?column? 
----------
 t
(1 row)

@puiterwijk do you know if there are any plans in Fedora infra to switch to a newer Postgres? For example the 9.5 or 9.6 version provided in RHEL7 software collections?

I remember reading about Postgres 9.4 being deployed in stage for BDR quite a long time ago, but it seems that never eventuated?

@dcallagh We do not use software collections in Fedora Infrastructure, so unless we go with BDR, we'll be staying with Postgres 9.2 for the foreseeable future.

About BDR, the problem is that most applications just weren't ready for it (since there are a few limitations you need to keep in mind when using it), and the developers didn't have the time to commit to it.
So if you want to be the first, please make sure your code works well with BDR, and we can see further about deploying this into production, but right now we don't have (enough) buy-in from developers.

Also, I'm curious: is there any reason why you need to store this data as JSON in the database?
It's relational data, in a relational database.
You can just "expand" the fields to different tables/rows in the database and use the queries that we've been using in sql for many years.

To go with your example: {"type": "koji_build", "item": "python-requests-1.2.3-1.fc26"}.

How about:

waiver
-------
id: int
result_id: int
subject: foreign key <subject.id>
...

subject
-----------
id: int
type: varchar (255)
item: varchar (255)

Or even:

subject_field
-------------------
subject_id: int
field_name: varchar(255)
field_value: varchar(255)

With an entry: insert INTO subject_field VALUES (<someid>, 'type', 'rpm');

Fun fact about using it without JSON(B): I'm going to guess that this would vastly optimize queries when looking up by subject, as you can actually use things like primary keys and indexes, and you only have to store the data once.

@puiterwijk: I think the reason he wants to use jsonb for that field is that he uses that JSON field as unique key identifying row in a table.

It is not generally known what will be stored in that JSON, but he wants to be able to find out if there is the same record in the table with the same JSON key/value pairs.

He can do this using few ways, as he already explained:
a) Create his own serializer/parser for that field to ensure that the json is stored as text in well-defined form (like keys are sorted, there are no extra whitespaces and so on), so the string-level comparision would work.
b) Use JSONB, which makes this work out of box.

Using extra table as you proposed would make this harder, but not impossible. it however depends if he needs to store nested JSON (like json dict in json dict), or just key/value pairs. Also the SELECT would be really ugly I presume, because it would have to compare all the key/value pairs in that extra table. I don't like this solution a lot from first view.

@jkaluza Note that sqlalchemy can help a lot with my suggestion, and if you just spend time to write a small wrapper function, it can be fully automatic and transparent...

@dcallagh could you perhaps cast the dictionary as an OrderedDict (ordered by keys) and then run json.dumps on it before storing it in the database? Then you'd be guaranteed to have the same formatting and key order. You'd likely have to do a database migration that does this to all existing database entries though.

@dcallagh I talked a little with @ralph and we thought: why don't we come back on the first solution that we had [1]?
Unfortunately, it got lost in all the rebases.

[1] https://avacariu.me/articles/2016/compiling-json-as-text-for-sqlite-with-sqlalchemy

@gnaponie: That would be possible, just with what @mprahl said - you need to ensure that key/value pairs are ordered always the same way, so you would have to use OrderedDict instead of default python dict.

Yeah, json.dumps can't sort keys directly, but json.encoder.JSONEncoder can:

>>> import json.encoder
>>> encoder = json.encoder.JSONEncoder(sort_keys=True)
>>> encoder.encode(dict(foo='bar'))
'{"foo": "bar"}'

Metadata Update from @dcallagh:
- Issue assigned to dcallagh

6 years ago

Okay. I'm not sure what the big deal is against using software collections. The macros are unpleasant to write in a spec file, but consuming the RHEL7 software collections is quite painless. You just yum install a different set of packages and it gives you a different systemd service to start but it's otherwise like a regular Postgres install.

It would be neat if we could have a second database in Fedora infra, say pg96db01.phx2.fedoraproject.org, which ran RHEL7 with Postgres 9.6 from Software Collections. And then applications could gradually migrate over to that when it suited them, until eventually there are none left on db01. And then repeat the process with Postgres 10 on RHEL8 when that comes out in a few years :-)

Anyway... Assuming we are stuck with Postgres 9.2 and we therefore can't use JSONB column types, our only other option is to ensure predictable serialization of the JSON values we put in.

This technique https://avacariu.me/articles/2016/compiling-json-as-text-for-sqlite-with-sqlalchemy is pretty much the same as the sqlalchemy_utils.JSONType. The newer SQLAlchemy native JSON type is basically equivalent to that too, except that it knows about the Postgres JSONB operators as well.

But none of those techniques guarantee that the same JSON structure going in will produce the exact same text when store in the database. But we need to guarantee that, if we are going to compare the entire JSON objects for equality as we want to do here. I guess most applications which are storing JSON are then querying out certain keys from it instead of doing equality comparison on the entire objects.

JSONEncoder(sort_keys=True) seems like it should be enough to guarantee consistent serialization, so I will try writing a custom SQLAlchemy type which uses that, instead of plain json.dumps. Nice find, thanks Ralph.

Oh and yeah, of course we would store this as a proper structured table if we had a fixed set of keys. But it is not "item" and "type" every time, as @jkaluza pointed out. The API accepts any arbitrary set of key-values for the subject. (I am increasingly convinced that is a bad design and in future I expect we will change it as part of greenwave#126).

Note to self, also index and compare USING subject::text

Do note that if it's plain key/value pairs, you could use the subject_field method... That just won't work with recursive key/values, but it would be good to make sure that you really need that before totally excluding that option.

For compactness, @dcallagh, try JSONEncoder(sort_keys=True, separators=(',', ':')). :neutral_face:

i would probably keep the default separators, to match the rows that will already be in the database...

Although I guess there is nothing to guarantee their keys are currently sorted either. So this will really need a migration to make sure all the existing values are serialized in the same consistent way.

Well, it's not pretty: PR#135.

I could also try a different approach, using a custom SQLAlchemy column type which emits the proper CAST() expressions, so that we don't need to remember to do it at every point in the code where we are doing comparisons or GROUP BY on the subject column.

Tested the migration process by hand only. I still want to write regression tests for the migrations, but haven't done that yet.

Alternative implementation using a custom SQLAlchemy type is PR#136. It's also nasty in slightly different ways. Pick your poison I guess.

It might have been buried somewhere - but is using RHSCL Postgres version not viable?

@sochotni, Patrick wrote:

@dcallagh We do not use software collections in Fedora Infrastructure, so unless we go with BDR, we'll be staying with Postgres 9.2 for the foreseeable future.

Yeah, I was schooled on this during standup :-) Oh well..

We went with PR#136. Tagged as Waiverdb 0.9.

Metadata Update from @dcallagh:
- Issue close_status updated to: Fixed
- Issue set to the milestone: 0.9
- Issue status updated to: Closed (was: Open)

6 years ago

Login to comment on this ticket.

Metadata