#7391 FreeIPA Status / Notification page for listing actionable events and their solutions.
Closed: duplicate 5 years ago by rcritten. Opened 6 years ago by tomk.

Request for enhancement

Consider adding a FreeIPA Status / Notification page for listing actionable events that might break a set configuration. Provisioning for a solutions page linked from said Status / Notification page would also be welcome.

Issue

There's currently no easy way to determine if a certain logged issue has broken IPA.

For example, if the replication is broken between two masters, there is no way to know this from the UI unless the user knows explicitly what to look for from within the log files. A tab with red, amber or green notifications would help determine any broken configuration. (Possibly even include alerting.)

There's currently an inability to see if all the AD users were correctly mapped from within FreeIPA. At the moment, it's unclear from running ipa group-add-member if this was done fully and which users / groups were added. This could be useful if a given group is empty hence why certain defined sudo rules do not work for a user. The User Group section currently does not list any AD users.

Third nice-to-have would be the ability to see an AD user's or group's properties from within FreeIPA after being mapped through external groups.

Steps to Reproduce

  1. N/A ( RFE = Request For Enhancement )
    2.
    3.

Actual behavior

RFE

Expected behavior

RFE

Version/Release/Distribution

Currently using the following version.

rpm -q freeipa-server freeipa-client ipa-server ipa-client 389-ds-base pki-ca krb5-server

package freeipa-server is not installed
package freeipa-client is not installed
ipa-server-4.5.0-22.el7.centos.x86_64
ipa-client-4.5.0-22.el7.centos.x86_64
389-ds-base-1.3.6.1-24.el7_4.x86_64
pki-ca-10.4.1-17.el7_4.noarch
krb5-server-1.15.1-8.el7.x86_64

Additional info:

N/A

Log file locations: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Linux_Domain_Identity_Authentication_and_Policy_Guide/config-files-logs.html
Troubleshooting guide: https://www.freeipa.org/page/Troubleshooting


Metadata Update from @rcritten:
- Issue set to the milestone: FreeIPA 4.8

6 years ago

@rcritten also seems to have some overlap with #4008.

See also https://pagure.io/freeipa/issue/5829 ([RFE] topology analysis tool) which
seems to have some overlap (especially with "architectural review" feature, if implemented).

I'm going to close this as a duplicate of the others.

Metadata Update from @rcritten:
- Issue close_status updated to: duplicate
- Issue status updated to: Closed (was: Open)

5 years ago

Hi there, I noticed many other tickets about this are closed, but related to https://www.freeipa.org/page/V4/Healthcheck which was shown to me by @mreynolds. CC @rcritten @dpal for visibility.

I really like the intent of this tool, but I think given my experience in RH I think the approach to the design may be overlooking a really important approach here. I think here you are trying to make a healthcheck tool that will check all the things you think are important, but they aren't actually important to a system implementor or support.

Honestly, the best way to approach this is to engage GSS - get them all in a meeting and get the top 5 most common cases of errors. Then you automate fixes to those cases, make those processes more robust, and if that is not possible, then your healthcheck tool is written to detect only those cases and give GSS a quicker diagnosis so they know how to follow up.

This is going to give you a huge return on investment of engineering time, on helping GSS to quickly diagnose issues, and for actually finding issues that deployments care about, focused on IPA specifics, and in many cases because you focused on automated solutions, you will even reduce support case load!

For example, you could spend a day writing disk monitoring into the tool, but no one will care because nagios will do it better, and it doesn't actually resolve any GSS cases.

A more complete case study of this is in Directory Server - When I was employed by Red Hat I would arrange meetings with GSS independently to ask for common issues. I think about 2016 this was performance, and specifically directory tuning affecting both IPA and DS. After discussing with GSS, they informed me that most customers didn't understand the ratio of dbcache/entrycache required, and in many cases there were complaints of IPA/DS being slow but we defaulted to 10MB of cache on an install. After that I wrote an automatic tuning tool that would detect memory limits and scale your server automatically with the appropriate values to assume IPA would be running on the same machine. After this release and update of the server, GSS reported that it was first easier to help customers (run this script, and restart instance, performance fixed), but also that NEW installs were never initiating calls in the first place. GSS for IDM then had a significant case load drop related to DS performance as a result of this work.

So my advice - if you want to write a healthcheck tool, you should focus on engaging with GSS, automating fixes to the issues they report and ONLY if you can't automate a fix, add it to the healthcheck so that intervention and diagnosis can be improved. A healthcheck tool is not a replacement for fixing weak processes (IE cert renewal), it should only exist when an automated fix can't be deployed (for example, weak DS aci's because we don't have contextual business knowledge to know what the admins intent was) .

My concern, and the reason I give this advice is that the current healthcheck design looks like it isn't engaging with the key stakeholders (GSS), and is more about reporting issues than automatically fixing them and making the processes more robust in the server which is a shortterm approach, rather than long term engineering.

Hope that advice helps on the approach.

For example, you could spend a day writing disk monitoring into the tool, but no one will care
because nagios will do it better, and it doesn't actually resolve any GSS cases.

What I forgot to say here is you could also write monitoring of the CA renewal case, but that's not fixing the problem, it's just making diagnosis quicker. You're spending a lot of effort to see the problem faster, but that doesn't help because you need to prevent the problem occuring at all, it's already trivially easy to detect on a system. Don't think about "how can we fix it after it explodes" think "how can it never explode at all".

@firstyear the support org was involved in developing Health Check requirements. We also continue to work on the robustness of the system, to avoid the problem occurring. I agree that prevention is better than cure (we all do). But prevention, diagnosis and treatment - all three are needed. Health Check is about diagnosis, and possibly in the future automatic remediation where possible.

@ftweedal That's not how the design reads though - I think that the design should explicitly state:

  • What are the support org concerns
  • What can be fixed directly
  • What can't be fixed becomes the healthcheck

I think that there is a lot of "hidden" design decision in this case which isn't on the upstream design, so community members like me are not able to see that process, which is why it certainly looks like customer support is not included.

Login to comment on this ticket.

Metadata