#47606 replica init/bulk import errors should be more verbose
Closed: wontfix None Opened 7 years ago by rmeggins.

For example - attempting to do repl init with a very large 100MB entry - the messages in the errors and access logs are not informative, and the user has no idea how to proceed e.g. increase nsslapd-maxbersize (but how large? there is no information in the access or errors log), increase nsslapd-cachememsize, increase nsslapd-import-cachesize (but again, there is no information about how much to increase these).


Replying to [ticket:47606 rmeggins]:

For example - attempting to do repl init with a very large 100MB entry - the messages in the errors and access logs are not informative, and the user has no idea how to proceed e.g. increase nsslapd-maxbersize (but how large? there is no information in the access or errors log), increase nsslapd-cachememsize, increase nsslapd-import-cachesize (but again, there is no information about how much to increase these).

In terms of the too large ber size error, the error is returned from openldap ber_get_next:
{{{
tagp = ber_get_next( conn->c_sb, &bytes_scanned, ber );
}}}
It returns LBER_OVERFLOW and assigned to
tagp, returned bytes_scanned stores 0, ERANGE is set to errno, and returned ber is almost empty.
{{{
p ber
$9 = {ber_opts = {lbo_valid = 2, lbo_options = 0, lbo_debug = 0}, ber_tag = 48, ber_len = 42237,
ber_usertag = 0, ber_buf = 0x0, ber_ptr = 0x7f265e9920c7 "", ber_end = 0x0, ber_sos_ptr = 0x0,
ber_rwptr = 0x7f265e9920c7 "", ber_memctx = 0x0}
}}}
Here's the comment when ber is too large in get_next_from_buffer.
{{{
/
openldap does not differentiate between length == 0
and length > max - all we know is that there was a
problem with the length - assume too big */
err = SLAPD_DISCONNECT_BER_TOO_BIG;
}}}
Since there's no entry size info available in get_next_from_buffer, it just logs the minimum error message in the error log:
{{{
connection - conn=4 fd=65 Incoming BER Element was too long, max allowable is 20480 bytes. Change the nsslapd-maxbersize attribute in cn=config to increase.
}}}
Do we have a chance to get the entry size info when it failed?

There may be enough information in the ber to use ber_peek_tag(BerElement ber, ber_len_t len) to get the len.

Description:
1. maxbersize: If the size of an entry is larger than the consumer's
maxbersize, the following error used to be logged:
Incoming BER Element was too long, max allowable is ### bytes.
Change the nsslapd-maxbersize attribute in cn=config to increase.
This message does not indicate how large the maxbersize needs to be.
This patch adds the code to retrieve the failed ber size.
Revised message:
Incoming BER Element was @@@ bytes, max allowable is ### bytes.
Change the nsslapd-maxbersize attribute in cn=config to increase.
Note: There is no lber API that returns the ber size if it fails to
handle the ber. This patch borrows the internal structure of ber
and get the size. This could be risky since the size or structure
of the ber could be updated in the openldap lber.
2. cache size: The bulk import depends upon the nsslapd-cachememsize
value in the backend instance entry (e.g., cn=userRoot,cn=ldbm
database,cn=plugins,cn=config). If an entry size is larger than
the cachememsize, the bulk import used to fail with this message:
import userRoot: REASON: entry too large (@@@ bytes) for the
import buffer size (### bytes). Try increasing nsslapd-
cachememsize.
Also, the message follows the skipping entry message:
import userRoot: WARNING: skipping entry "<DN>"
but actually, it did NOT "skip" the entry and continue the bulk
import, but it failed there and completely wiped out the backend
database.
This patch modifies the message as follows:
import userRoot: REASON: entry too large (@@@ bytes) for the
effective import buffer size (### bytes). Try increasing nsslapd-
cachememsize for the backend instance "userRoot".
and as the message mentions, it just skips the failed entry and
continues the bulk import.

ber_peek_tag doesn't work? or ber_peek_element?

So the way bulk import/repl init works with this patch is that entries will be skipped? If entries are skipped, will the overall operation report success? I'm not sure that's a good thing. If someone does a repl init, it will report success, and they will think everything is fine, only to find out (much?) later that everything is not fine.

I think import should fail hard as soon as there is a problem. If we clearly tell people what's wrong, and how to fix it, that should be sufficient.

git patch file (master) -- revised: including mozldap ber, as well.
0001-Ticket-47606-replica-init-bulk-import-errors-should-.2.patch

Looks good, but I don't think we should skip entries during replica init - I think it should be a hard failure. The user may not know there were problems. The consumer errors log may fill up with thousands of error messages.

Replying to [comment:10 rmeggins]:

Looks good, but I don't think we should skip entries during replica init - I think it should be a hard failure. The user may not know there were problems. The consumer errors log may fill up with thousands of error messages.

Yes, that's true... My concern is when I was debugging this bug and bulk import failed, the consumer's DB is completely wiped out and any attempt to recover the db was quite tricky and painful -- I had to remove the consumer's replica, reinitializing the db using a suffix ldif file, then added back the replication setup to the consumer. It might confuse the customers, as well...

side note:
http://www.openldap.org/its/index.cgi
Logged in as guest
Viewing Incoming/7764
Date: Fri, 13 Dec 2013 21:42:03 +0000
From: nhosoi@gmail.com
To: openldap-its@OpenLDAP.org
Subject: RFE: library lber method which returns the ber size even if the ber is overflown

Full_Name: Noriko Hosoi
Version: 2.4.35-4
OS: Fedora 18
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (209.132.181.86)

We use the OpenLdap library in our software. LDAP clients could send too large
ber and cause LBER_OVERFLOW (or LBER_DEFAULT) to the lber APIs. We'd like to
learn how large the ber size we should prepare from the error. When we look
into the BerElement ber using gdb, ber->ber_len stores the value. But the value
is not returned to the caller when the API fails, e.g., by the overflow. Could
it be possible to have a way to retrieve the ber size under any condition?

Replying to [comment:11 nhosoi]:

Replying to [comment:10 rmeggins]:

Looks good, but I don't think we should skip entries during replica init - I think it should be a hard failure. The user may not know there were problems. The consumer errors log may fill up with thousands of error messages.

Yes, that's true... My concern is when I was debugging this bug and bulk import failed, the consumer's DB is completely wiped out and any attempt to recover the db was quite tricky and painful -- I had to remove the consumer's replica, reinitializing the db using a suffix ldif file, then added back the replication setup to the consumer. It might confuse the customers, as well...

In any case, when bulk import fails, the database is wiped out. That's the point of bulk import/replica init - it wipes out your database and replaces it with the one you specify. But I'm not sure why you had to reconfigure the replica - usually when replica init fails you just retry it, you don't have to reconfigure. That sounds like another bug.

Thanks for your comments, Rich.

The code skipping a failed entry instead of failing there in bulk_import_queue was backed off.

And this is the cause why repl5_tot_waitfor_async_results did not return with "done" when a failure was returned from the connection.
3. In repl5_tot_result_threadmain, when conn_read_result_ex returns
non zero (non SUCCESS), it sets abort, but does not set any error
code to rc (return code), which is not considered as "finished" in
repl5_tot_waitfor_async_results and it contines waiting until the
code reaches the max loop count (about 5 minutes). This patch sets
LDAP_CONNECT_ERROR to the return code along with setting abort, if
conn_read_result_ex returns CONN_NOT_CONNECTED. This makes the bulk
import finishes quickly when it fails.

Reviewed by Rich (Thank you!!)

Pushed to master:
97cf45d..1119083 master -> master
commit 1119083

Pushed to 389-ds-base-1.3.2:
4ea2fde..fde9ed5 389-ds-base-1.3.2 -> 389-ds-base-1.3.2
commit fde9ed5

Pushed to 389-ds-base-1.3.1:
ae3daa1..c9d0b6c 389-ds-base-1.3.1 -> 389-ds-base-1.3.1
commit c9d0b6c

Metadata Update from @nhosoi:
- Issue assigned to nhosoi
- Issue set to the milestone: 1.3.1.17

4 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/943

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

5 months ago

Login to comment on this ticket.

Metadata