#51156 Add failover credentials to replication agreement
Closed: wontfix 3 years ago by spichugi. Opened 3 years ago by mreynolds.

Issue Description

When using a Bind DN Group for a replicated suffix it opens up issues where the Bind DN Group can get out of sync, and then replication breaks. For example, if you are using a Bind DN Group and GSSAPI authentication. If the krb credentials change for a member of the group it will fail to authenticate to the remote replica. Chicken vs egg, We need to replicate the credential update but we can not authenticate to replicate that change.

I'm proposing adding a failover account/credentials to an agreement.

nsDS5ReplicaFailoverBindDN: REPL_MANAGER_DN
nsDS5ReplicaFailoverCredentials: PASSWORD
nsDS5ReplicaFailoverBindMethod: SIMPLE
nsDS5ReplicaFailoverTransportInfo: LDAPS

So at the start of each replication session if we fail to bind with the default credentials we can fall back to "cn=replication manager,cn=config" for example. In the scenario above, after this failover bind, the server will be in sync and on the next session it will go back to trying the default credentials.


Metadata Update from @mreynolds:
- Custom field origin adjusted to IPA
- Custom field reviewstatus adjusted to None
- Issue priority set to: major
- Issue tagged with: RFE, Replication

3 years ago

I know your intent is good here @mreynolds but the problem with this is that failover really means "two sets of valid credentials". It's not so much failover as "hey both of these work". And if you are in a situation where GSSAPI is failing all the time, but the simple bind method always works, why not use simple binds and TLS? It's faster, more secure, and honestly better in every way. It's simple to manage, simpler to debug. Using GSSAPI here adds nothing to the systems security except for an obsession with using GSSAPI everywhere. And having "two sets" of working credentials now means you have twice the surface area to understand, debug, and secure. Even better, when GSSAPI is failing, simple bind will just silently work in the background and no one will be any the wiser.

IMO the flaw here is that FreeIPA uses GSSAPI for repl, rather than nominating passwords on their replica accounts, and using simple binds instead. If you use replication manager in lib389, this creates service accounts and simple passwords for the topology with the join/unjoin steps (rather than a globally identical cn=replication manager,cn=config).

So I know you are trying to help and fix a real issue for FreeIPA, but this is a problem of their own making, that they need to resolve here, and we need to guide and support them to understand this. It's a poor use of our time to add this when there are more robust, simple, and better solutions available that already exist.

PS; you can use the existing ipa replica accounts in the DIT as replication accounts, which is what IPA already does, just with simple password instead of GSSAPI, and this would literally fix this entire situation, make backup/restore easier, etc etc. All that IPA needs to do is randomly generate a replica account password and put it into their agreements etc.

Some things to consider, there are security-minded customers that cannot have any (even hashed passwords) in a file. However, this could be configurable, or we could use the server-cert as failover?

Why use server-cert as failover when you could just use the server cert instead of GSSAPI?

From the security standpoint, kerberos does expire and renew more often than the ssl cert. Would need to consider the likelihood of MiTM attacks or other vulnerabilities using ssl cert vs kerberos tgt vs user/password in files.

From the standpoint of replication conflicts, both 389ds and ipa get them, but ipa seems to get them more often, is there something different in how 389ds handles them vs ipa?

From the security standpoint, kerberos does expire and renew more often than the ssl cert. Would need to consider the likelihood of MiTM attacks or other vulnerabilities using ssl cert vs kerberos tgt vs user/password in files.

No, the sessions expire, but the server key material stays the same. And GSSAPI has less examination and hardening from a cryptographic perspective than TLS libraries.

In fact, the whole use of GSSAPI weakens your system security because it necessitates the existence of a plaintext port, where if you didn't you could have TLS only. Maybe in the past, GSSAPI was equivalent in security but not anymore. TLS protects billions of operations every day without being MITMed. GSSAPI is barely a blip by comparison. TLS is proven and it is an invested in and improving landscape. GSSAPI/KRB is stagnant by comparison.

TLS with username and password today is more secure than GSSAPI, easier to administer and configure. And using server-certs is just as good, and easier to configure and administer. GSSAPI is notorious as one of the most complex, un-usable, undebuggable systems to exist on the planet, and every IPA admin in the world would probably love you forever if you didn't use GSSAPI in the replication path. As would customer support, and developers because they could finally understand what's happening in their system.

But this problem is 100% the creation of FreeIPA insisting on GSSAPI for everything, even when simpler solutions exist - because that's how AD does it, so they have to do the same. The moment we have a fall back mechanism, we'll silently fall back and use it all the time. We are adding more complexity, so that FreeIPA can fail silently without upsetting people's ideologies. When what needs to happen is FreeIPA needs to take a look at itself and it's own complexity and realise that is the source of this issue, and that it's not up to us to fix it for them in this situation.

From the standpoint of replication conflicts, both 389ds and ipa get them, but ipa seems to get them more often, is there something different in how 389ds handles them vs ipa?

Replication is the same between IPA and 389-ds. The issue is IPA treats replication as "always consistent" and so developers and admins assume that a change on one server is immediate to another. The difference is DS deployments and admins tend to be simpler in configuration and have a better history of knowing that replication is eventually consistent instead, so you have things like "all writes go to one replica" configurations, rather than IPA's "ohh we'll upgrade every master at once and then wonder why they all stomped on each others replication".

As well, 389-ds is capable of read-only replication, so people configure it and understand that model better. You have say 4 writable replicas and then 1000's of read-onlies, and it works at a planet scale. The topology can converge on a consistent state much much quicker in this configuration.

Where FreeIPA is not capable of this, so you expand the rw replication graph to huge sizes (20 to 60), which means that "eventually" takes even longer to converge, and that longer window of time is more time that inconsistency can arise. And all because of the idealism that "oh noes we have to re-key the kdc for read only replication and we can't do that yet" rather that admitting no one actually cares about cryptographic idealism attacks on a kdc in 2020, and allowing read only replicas to exist which would basically remove an entire class of problems in freeipa like horizontal scaling, repl conflicts and more.

But FreeIPA is a project which is all about KRB, and 389-ds is a piece of the puzzle in the path of a few peoples obsession around KRB, even when it's blatantly obvious that it's not the correct tool for the situation. Like for example, using GSSAPI to authenticate your replication mechanism creating a whole stack of chicken-and-egg problems and whole classes of issues that wouldn't exist if robustness and simplicity were prioritised instead.

Anyway, I've said my piece. I think this is a 100% a waste of the time of the 389-ds team, when the problem is clearly not ours. And I'm frankly tired of seeing issues like this. I empathise with Mark and the team having to spend their time on it, and I will review it if we decide to go ahead, but I strongly believe we should not be implementing this.

@firstyear you are raising very valid points regarding benefit/drawback of using GSSAPI vs simple bind for replication agreement. However the ticket is not limited to the use of a specific auth mech as primary auth mech, it could also be a primary simple bind auth and a fallback simple bind auth (or GSSAPI).

IMHO this ticket is about how to avoid the risk of a host to become isolated because of auth failure. Admin can create multiple paths but with many hosts it becomes complex and have drawbacks (like competition between the links). I see the failover auth as a easy way to mitigate that risk. Should we create replication agreement subentries to specifies several failover ?

I agree the fall back will be silently used (but we can always log a warning) but it can be transient as the primary auth will be tried again at the next session

I know your intent is good here @mreynolds but the problem with this is that failover really means "two sets of valid credentials". It's not so much failover as "hey both of these work".

Correct, but this is already built into replication today. On the replica receiving end we allow two sets of managers via:

 nsDS5ReplicaBindDN: cn=replication manager,cn=config
 nsDS5ReplicaBindDNgroup: cn=replication managers, cn=sysaccounts, cn=etc, dc=example,dc=com

It's valid to have and use both. So having two sets of credentials in the agreement is not a stretch from the current design. In the IPA case, the fallback bind would typically only happen once as it would allow the bind group to get updated. On the next repl session it would go back to using the primary auth method/bind group.

Typically we can only get into this situation with GSSAPI, but the problem is not specific to GSSAPI, it's specific to Bind DN Groups. Another problem here with using Bind DN Groups is that you can not do an online init if the group members are not already present on the replica. So fresh installations can not use Bind DN Groups and online inits. It's a corner case with workarounds, but I'm just looking at this from the robustness POV.

This ticket was really just to discuss the options. Looks like IPA might have a fix for the current problem anyway, but that has not been confirmed yet.

I agree the fall back will be silently used (but we can always log a warning) but it can be transient as the primary auth will be tried again at the next session

The moment we make a failover, and have "logged warnings", those are warnings people won't check or read because the system works. It's better to fail fast, and fail noisily, so that issues are investigated and resolved. This is a core element of reliable systems, rather than masking or hiding failures.

For a good example, look at SSSD a few years ago. It had so many issues with crashes and restarts, that instead of fixing the problem (code quality, reliability, simplicity), they added the systemd automatic restart to mask those issues. Now if something goes wrong, it's a blip, people think it's a glitch, they can't understand why it occasionally drops an event. It's hard to then isolate that SSSD crashing and restarting was the fault. Where if SSSD crashed and stayed down, it would immediately set off warning flags about the health of the application.

For a similar reason, I opposed automatic restarts of 389-ds a number of years ago when it was raised, and since then through our teams focus on testing, use of ASAN, and more, we now have an extremely reliable, hardened system - where no one is asking for "silent restarts" or "hiding issues".

This is more than just about a technical solution to a technical problem, but about the social and user impact of silently hiding faults in a system. And we should never do that.

Typically we can only get into this situation with GSSAPI, but the problem is not specific to GSSAPI, it's specific to Bind DN Groups. Another problem here with using Bind DN Groups is that you can not do an online init if the group members are not already present on the replica. So fresh installations can not use Bind DN Groups and online inits. It's a corner case with workarounds, but I'm just looking at this from the robustness POV.

Yep there is a chicken and egg issue here. Have a look at the lib389 replication manager code for bind dn groups and how it achieves it. It does an initial bootstrap with the replication manager, then it purges it, and uses the bind dn from then on. IIRC it's how all replication topologies in lib389 tests are allocated and managed now (unless someone changed it). And there are even ways to make this more reliable. You could have say ... a second backend for replication managers, or whatever.

The bigger fault is the use of GSSAPI, which adds no value in terms of security, but brings huge amounts of complexity into the environment, and as mentioned, is really really hard to debug and gain transparency and visibility into.

If this was a bind dn group with passwords, it would be trivial to fix if they (somehow) became out of sync. Just reset the password on the replication agreements and the service account, look magic it works. And there is no reason for it to even go out of sync in the first place because as a static credential it doesn't need rotating.

If it was bind groups with TLS, again, easy to manage. You can see how the certmap works, you can test it easily, and you can much more easily debug and fix the certificates.

But repairing this with GSSAPI? I have no idea where to start. I guess maybe rextract all the service keytabs and copy them around?

Now imagine you were a sysadmin. Not a 389 developer, with loads of LDAP and GSSAPI experience. A sysadmin. Which of these three would you want to debug and fix? Which of these could you understand?

I really really think that we shouldn't pursue this as a line of thought. We need to spend more time educating FreeIPA that simple, robust and reliable designs, that are easier to manage are a more important way to resolve this kind of situation from ever occuring because rather than needing to fix issues, the issues never occur in the first place. There seem to be a lot of FreeIPA specific problems that don't happen in 389-ds that really come back to FreeIPA confusing "complex" with "good" rather than "correct, simple and accessible" as "good".

I really think it's not our problem, and we shouldn't spend our time on this.

Metadata Update from @mreynolds:
- Issue set to the milestone: 1.4.4

3 years ago

I do not want to get into the discussion about using GSSAPI. We offer it, so we have to support it - end of story. But this is really about using bind dn groups...

I have recently discussed this issue with Thierry again, and we believe there is good value in doing this feature. I think it would be better to look at this as having "bootstrap credentials" for bind dn groups, and not as "failover credentials". We feel there are just no good reasons for not doing this feature. Just because there are better ways to configure replication doesn't mean we should not support the "less than ideal" configurations. This feature will make the lives of our customers, consultants, and support folks much easier - that alone is the only justification we should need to pursue this.

Also, this will be easy to implement - it's not some huge feature that will require a redesign or massive code changes to replication. It would be different if this would be hard to code or very time consuming to implement, but really it's quite simple to add this feature (I've already done all the code investigation). So while the points made by @firstyear are very valid, it's not enough in our opinion to do not do this. The pros simply outweigh the cons. We will just have to agree to disagree on this issue.

@firstyear, one of the valid point (IMHO) you raised was that it may hide a error condition as we use a failover credential rather than the expected one. It is a valid point but it is transient. The next session will use the expected credential. A warning when we have to fallback to the failover credential will prevent to hide an error condition.

I agree it exists others ways to make replication more reliable to an authentication failure (duplicate RA, duplicated paths, moving credential to a separated replicated backend,...) but all of them look more complex and difficult to setup than this one.

@firstyear, one of the valid point (IMHO) you raised was that it may hide a error condition as we use a failover credential rather than the expected one. It is a valid point but it is transient. The next session will use the expected credential. A warning when we have to fallback to the failover credential will prevent to hide an error condition.

What I want to say is don't think of this as a "failover credential". It's not. It's two, equivalent sets of credentials. They are both achieving and capable of the same operation. They both have the same high level of rights in the directory. And they both as a result need an equivalent level of securing. This completely changes the threat model and risk around replication in the topology from the current situation with bind dn groups.

So if we have one credential that is failing, and one that always works, and they are equivalent, then why do we persist to use the bind dn group? It's value (isolating credentials per replica) becomes void once we have distributed a global shared replication secret. Because they are equivalent credentials. Which means there is either no value in using the bind dn group in the first place as a global shared secret was acceptable as a security risk. Or that the process of restoring the bind dn group credentials is flawed when a global shared secret is not acceptable as a security risk, which means that an ephemeral solution is required rather than multiple sets of equivalent credentials.

We will just have to agree to disagree on this issue.

Yep :) My only "hard request" is that this feature can NOT be called a "failover" credential, but it must be called multiple replica credentials to highlight the fact that they are security equivalents in the directory and topology, and all of the credentials must be considered in the security and risk model of the administrator/integrator.

I do not understand why the proposed solution extends any risk around replication. If a failover authentication succeeds it is because the consumer accepts this authentication as valid to achieve replication. The risk is taken by the consumer that already accepts several credentials (groups or list of binddn) including some failover ones. The risk is already accepted by the application that deployed the consumer in such way.
What is not balanced is that if the consumer accepts several auth, the supplier can not select the failover auth that the consumer accept.

The entire point around bind dn groups is when you have a topology, let's say ... host A, B, C. Let's now say that host A is compromised (obviously we hope not but it happens). Okay, we realise now, so we can revoke the credentials to host A. B and C can continue to work with their credentials, everything is happy. (In fact, we only need to revoke the credential on one host, and the revocation flows out to all others.).

Now, if this was a shared credential like the global replication manager? Well, if A is compromised, we need to now touch B and C to change their replication manager pw's to remove them.

So what if this was 20 servers? 100?

And if it was like this .... maybe the admin will choose to "accept the risk" and not reset the global replication account pw because it's "too hard". Which may lead to further attacks etc.

And this thread could happen in many ways. Un-happy administrator? Poorly protected backup?

cn=replication is hugely powerful. That single bind dn, like directory manager can bypass aci's. That's why being able to revoke that credential quickly, and effectively matters.

This problem is about a risk and threat model in real deployments - about how to handle a post-compromise scenario.

Think of it this way. Replication manager is a shared secret on all hosts. Where the bind group is per-server accounts. And that really does change the risk and threat model for deployments. Mixing these methods means we become "the weakest link" - the shared secret.

Which is why I'm saying "if we accept failover, why not just use that in the first place"?

Thanks for this excellent example. Note that in your example revocation of A is simpler because it is centralized auth mechanism while revocation of 'replication manager' is expensive because it is local auth mechanism. The decision to make centralized/local auth is application dependent and considering your example a secure application should choose centralized mechanism.

Back to your question "if we accept failover, why not just use that in the first place", I would say because we want to get a warning (when using this failover auth) that the application is not running as expected.

Again it is a decision (not related to that patch) of the consumer to accept several authentication and auth. mech. If some mechanism make the application weaker it can be improved with or without this patch.

Thanks for this excellent example. Note that in your example revocation of A is simpler because it is centralized auth mechanism while revocation of 'replication manager' is expensive because it is local auth mechanism. The decision to make centralized/local auth is application dependent and considering your example a secure application should choose centralized mechanism.

Exactly my point. That you have to choose based on your security needs and threat model.

That's why having "failover" or "equivalent" means that you have a lowest-common risk model. In this case, if you have both bindgroup AND repl manager, than your "security" baseline is the repl manager account. This means you have TWO revocation scenarioes, and all the benefits of the binddn group don't exist beacuse you have the repl manager too.

Back to your question "if we accept failover, why not just use that in the first place", I would say because we want to get a warning (when using this failover auth) that the application is not running as expected.
Again it is a decision (not related to that patch) of the consumer to accept several authentication and auth. mech. If some mechanism make the application weaker it can be improved with or without this patch.

But we've had this issue before many times - people don't read warnings, they don't check this unless they actively are seeking the feedback. These errors will be silently ignored in logs. So if the goal is to warn or raise an alter when something has failed, that's a monitoring issues, not a equivelant credential issue, and we should improve our replication monitoring instead?

Metadata Update from @mreynolds:
- Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1848359

3 years ago

Metadata Update from @mreynolds:
- Issue set to the milestone: 1.4.3 (was: 1.4.4)

3 years ago

Metadata Update from @mreynolds:
- Issue assigned to mreynolds

3 years ago

Metadata Update from @mreynolds:
- Assignee reset

3 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/4209

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix
- Issue status updated to: Closed (was: Open)

3 years ago

Login to comment on this ticket.

Metadata