#2001 logrotate needs fix on copr-be.aws.f.o
Closed: Fixed 2 years ago by praiskup. Opened 2 years ago by praiskup.

Today I was too busy to get to this, so I disabled logrotate temporarily on production copr-be:
https://pagure.io/fedora-infrastructure/issue/10391

I just did, and I hope I can get to this later (or if anyone is faster?):
systemctl stop logrotate.timer


We certainly have a problem with the certificate, but I don't know if it's related

Dec  2 00:05:06 copr-be-temp lighttpd[851]: 2021-12-02 00:05:06: (mod_openssl.c.752) SSL: BIO_read_filename('/etc/lighttpd/copr-be.cloud.fedoraproject.org.intermediate.cert') failed
Dec  2 00:05:06 copr-be-temp lighttpd[851]: 2021-12-02 00:05:06: (mod_openssl.c.2814) SSL: error:0909006C:PEM routines:get_name:no start line /etc/lighttpd/copr-be.cloud.fedoraproject.org.intermediate.cert
Dec  2 00:05:06 copr-be-temp lighttpd[851]: 2021-12-02 00:05:06: (server.c.1282) Initialization of plugins failed. Going down.
Dec  2 00:05:06 copr-be-temp systemd[1]: lighttpd.service: Main process exited, code=exited, status=255/EXCEPTION
Dec  2 00:05:06 copr-be-temp audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=lighttpd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Dec  2 00:05:06 copr-be-temp systemd[1]: lighttpd.service: Failed with result 'exit-code'.

Metadata Update from @praiskup:
- Issue assigned to praiskup

2 years ago

Metadata Update from @praiskup:
- Issue status updated to: Open (was: Closed)

2 years ago

Meh this:

Dec 03 16:46:18 copr-be.aws.fedoraproject.org logrotate[616474]: error: Compressing program wrote following message to stderr when compressing log /var/log/lighttpd/access.log-20211203:
Dec 03 16:46:18 copr-be.aws.fedoraproject.org logrotate[616474]: gzip: stdin: file size changed while zipping

Meh, @kdudka, do you think that gzip error could cause that the postrotate script
wasn't called at all? Because that is what either happened, or the -HUP signal
wasn't properly delivered (I was able to easily send the signal using kill right after
rotation.

For myself:
And still, to rotate error.log correctly it is not enough to just -HUP the cronolog -- we have to
also -HUP lighttpd. :-( Still TODO, I'm turning logrotate off for now (again).

Meh, @kdudka, do you think that gzip error could cause that the postrotate script
wasn't called at all?

I think so, hasErrors is set by compressLogFile. Logrotate then doesn't run the post script.
But there's the delaycompress option.
I will try that.

Commit a10a07f0 relates to this ticket

The error handling is affected by the sharedscripts configuration directive. Using delaycompress may help to avoid the gzip: stdin: file size changed while zipping error. Note that all the commit links on this page give me 404.

The error handling is affected by the sharedscripts configuration directive.

I see, but we can not avoid that :-( we would "kill" the service twice
for each logfile. I am not sure, but I assume this is the reason why lighttpd
failed to start after rotation before (first we systemctl restarted the service for one log, and while
still restarting - we sent the -HUP signal for the other log. Or vice versa.

Using delaycompress may help to avoid the gzip: stdin: file size changed while zipping error.

Yes it helps, I tested this in the meantime:
https://pagure.io/fedora-infra/ansible/blob/a10a07f0ef88f364e6d909117d47f2713ccf69a6/f/roles/copr/backend/templates/logrotate/lighttpd.j2#_12-27

Note that all the commit links on this page give me 404.

Sorry, this is a bug in Pagure. It automatically inter-links issues among different
projects. I (re-)reported it now:
https://pagure.io/pagure/issue/5256
https://pagure.io/pagure/issue/5085

Yes it helps, I tested this in the meantime:

That is just a work-around.
We should long-term avoid using either logrotate or cronolog. And adjust the hitcounter script so it doesn't rely on the log rotation mechanism (perhaps some httpd plugin, or something like that).

Hopefully this can be solved at the configuration level. We cannot safely change behavior of the sharedscripts directive. Touching the code triggered ugly regressions in the past. Doing it in a backward-compatible way would also be tricky.

I babysat it today..., and rotating the logs lighttpd was seemingly running, though not responding for two minutes or so... and I couldn't resist (I triggered manual restart).

Metadata Update from @praiskup:
- Issue priority set to: None (was: High)

2 years ago

FTR, it's not necessary to killall.. kill is enoug, so I filled this PR:
https://src.fedoraproject.org/rpms/lighttpd/pull-request/3

Figting with internet connection today :-( I wasn't quick enough on the PR update, so here is another one:
https://src.fedoraproject.org/rpms/lighttpd/pull-request/4

And for ansible:
https://pagure.io/fedora-infra/ansible/c/9f5ae518054478a99a34c52ee3d519e3c7fefcec?branch=main

Ugh. Happened again, I had to manually restart lighttpd server.

I don't know what precisely happens upon SIGHUP/SIGUSR1, the only
weird warning there is:
2021-12-08 00:02:41: (server.c.256) warning: clock jumped 34246 secs
And the sub-processes are not starting.

Upon full server restart there's (perhaps expected, just noting here):

2021-12-08 00:02:15: (server.c.1551) server started (lighttpd/1.4.61)
2021-12-08 00:02:15: (mod_openssl.c.3247) SSL: 1 error:14209102:SSL routines:tls_early_post_process_client_hello:unsupported protocol
2021-12-08 00:02:15: (mod_openssl.c.3247) SSL: 1 error:14209102:SSL routines:tls_early_post_process_client_hello:unsupported protocol
2021-12-08 00:02:15: (mod_openssl.c.3211) SSL: -1 5 32: Broken pipe
2021-12-08 00:02:15: (mod_openssl.c.3247) SSL: 1 error:14209102:SSL routines:tls_early_post_process_client_hello:unsupported protocol
2021-12-08 00:02:15: (mod_openssl.c.3247) SSL: 1 error:14209102:SSL routines:tls_early_post_process_client_hello:unsupported protocol
2021-12-08 00:02:15: (mod_openssl.c.3211) SSL: -1 5 32: Broken pipe

Note that full-server restart usually worked for me, but I had to
systemctl kill lighttpd twice so far, once upon playbook run
doing the restart. Simply, no new connections were accepted
(server down) and still the server failed to stop and start again.
And even if restart eventually works, many times it causes
two minutes outage. Those are the reasons why I simply don't
want to restart daily ...

So I realized that we can run error.log through cronolog as well,
so I reworked the logrotate script and now we shouldn't reload
nor restart lighttpd at all:

https://pagure.io/fedora-infra/ansible/c/83673506b615beb7fb04754121f536dd3e7e9baf?branch=main

Once anyone has more patience than me, we could try to configure
the lighttpd-angel process. See man lighttpd-angel.

FTR, I temporarily changed the logrotate config to run in more friendly time,
so we can observe the issues. Systemctl status logrotate.timer now says:
Trigger: Thu 2021-12-09 09:00:00 UTC; 22h left, so

date --date "Thu 2021-12-09 09:00:00 UTC"
Thu Dec  9 10:00:00 AM CET 2021

I finally understood the problem.

I was curious why I see this error.log output upon SIGHUP to cronolog-only (not
to lighttpd or so):

2021-12-08 09:59:03: (server.c.256) warning: clock jumped 4106 secs
2021-12-08 09:59:03: (server.c.262) attempting graceful restart in < ~5 seconds, else hard restart
2021-12-08 11:07:29: (server.c.1012) [note] graceful shutdown started
2021-12-08 11:07:29: (server.c.1012) [note] graceful shutdown started
2021-12-08 11:07:29: (server.c.2057) server stopped by UID = 986 PID = 2056345
2021-12-08 11:07:29: (server.c.1012) [note] graceful shutdown started
2021-12-08 11:07:29: (server.c.2057) server stopped by UID = 986 PID = 2056345
2021-12-08 11:07:29: (server.c.1012) [note] graceful shutdown started
2021-12-08 11:07:29: (server.c.2057) server stopped by UID = 986 PID = 2056345
2021-12-08 11:07:29: (server.c.1012) [note] graceful shutdown started
2021-12-08 11:07:29: (server.c.2057) server stopped by UID = 986 PID = 2056345
2021-12-08 11:07:29: (server.c.1012) [note] graceful shutdown started
2021-12-08 11:07:29: (server.c.2057) server stopped by UID = 986 PID = 2056345
2021-12-08 11:07:29: (server.c.1012) [note] graceful shutdown started
2021-12-08 11:07:29: (server.c.2057) server stopped by UID = 986 PID = 2056345
2021-12-08 11:07:29: (server.c.1551) server started (lighttpd/1.4.61)

And that's becuase Lighty restarts all its forks anytime we terminate any of
it's forks (including cronolog processes). And that's what I was exactly doing
using logrotate:

  • logs were rotated
  • logrotate sent -HUP to cronolog
  • cronolog terminated
  • lighty detected this through wait()
  • the time inconsistency warning appeared
  • lighty attempted (not successfully, sometimes) to restarted all the child processes

The code responsible for this behavior (racy) in Lighty has actually
a different purpose -- just to monitor that SIGHUP to Lighty was
correctly delivered to (and handled by) the child processes. See the code.

The fact that SIGHUP was actually never delivered to Lighty, only to
cronolog caused that Lighty's log_epoch_secs was too old (actually
initialized with the "main" process, at the initial process tree start).

Therefore the "clock jumped 4106 secods" (and other even larger values).
This should be measured between SIGHUP (to Lighty) and the actual child
termination (if it doesn't happen quickly enough, child processes are
force-restarted).

Considering we can not easily restart cronolog from Lighttpd, and that
cronolog doesn't handle SIGHUP - it becomes mostly impossible to do this
correctly (lighty+cronolog+logrotate).

Fortunately, implementing an equivalent utility to cronolog, without the
rotation mechanism (which we did not use anyway!) was pretty trivial in
shell. I've tested this both in staging and production environments, by

systemctl start logrotate.service

Everything is correctly rotated, and Lighty did not notice this at all.

This is the current code used:
https://pagure.io/fedora-infra/ansible/c/013344529aa33e133e95a4fd96703635cab5231f?branch=main

As @praiskup noted in:
https://pagure.io/fedora-infra/ansible/c/19b8f82f5b5552ee59eb539f2187639e65925c5e

stop using max-workers && cronlog

As a lighttpd developer, I would be interested in understanding why server.max-workers with a non-zero value is needed with lighttpd. With modern versions of lighttpd, a single CPU can handle quite a bit of load, and most of that load comes from TLS negotiation for new socket connections. If there is a different bottleneck, I would very much like to know.

Please test without server.max-workers. If you can remove server.max-workers, then the problem described goes away since cronolog would not be needed, and a simple logrotate config could be used.

If server.max-workers turns out to be needed, then why not have cronolog (with --symlink ... --prevlink ...) perform the daily log rotation, and have logrotate run a script to run /usr/bin/copr_log_hitcounter.py on the target of --prevlink and then compress the target?

Until the next lighttpd release which fixes time jump detection when server.max-workers is non-zero, you can disable lighttpd time jump detection by configuring lighttpd:
server.feature-flags += ("server.clock-jump-restart" => 0)

https://pagure.io/fedora-infra/ansible/c/83673506b615beb7fb04754121f536dd3e7e9baf?branch=main

Lighty fails to reload (both on SIGHUP and SIGUSR1 signals). Something simply hangs the processes.

SIGHUP to lighttpd does not restart lighttpd. Due to historic reasons, the signal causes lighttpd to rotate open log files (and does not touch piped loggers)

SIGUSR1 to lighttpd is a graceful restart, and lighttpd waits for existing connections to close. The historical default is to wait forever. As announced in release notes for over a year, starting with the next lighttpd release in early 2022, lighttpd will default to set server.feature-flags += ("server.graceful-shutdown-timeout" => 8) so that the graceful restart will wait for up to 8 seconds for existing requests to finish before (not-so-gracefully) aborting remaining connections. This is configurable in lighttpd.conf.

Fortunately, implementing an equivalent utility to cronolog, without the
rotation mechanism (which we did not use anyway!) was pretty trivial in
shell.
https://pagure.io/fedora-infra/ansible/c/013344529aa33e133e95a4fd96703635cab5231f?branch=main

You commented

But we would get much higher throughput if implemented in C.

You would get much higher throughput if implemented in Perl or Python (or even awk), too, since shells tend to be inefficient reading character-by-character.

I tested roles/copr/backend/files/copr-lighty-logger from https://pagure.io/fedora-infra/ansible/c/013344529aa33e133e95a4fd96703635cab5231f?branch=main on my laptop and found
$ time cat /tmp/test-file | /tmp/logger /tmp/output-measured
real 1m22.694s
user 0m36.136s
sys 1m19.069s

Rewritten (quickly) into a Perl one-line is over 75x faster
$ time cat /tmp/test-file | perl -e 'sub reopen { open(\*STDOUT,">>",$ARGV[0]); $|=1; } $SIG{HUP} = reopen; reopen(); print $_ while (<STDIN>);' /tmp/output-measured
real 0m1.076s
user 0m0.545s
sys 0m0.587s

lighttpd is still fast, but your roles/copr/backend/files/copr-lighty-logger causes writing the access log entry to be -- by far -- the biggest bottleneck.

I reviewed https://pagure.io/fedora-infra/ansible/blob/main/f/roles/copr/backend/templates/lighttpd/lighttpd.conf and am convinced that server.max-worker is almost surely unnecessary. The commit 3 years ago in 2018 does not describe the problem that was being solved or how it was measured that the change solved the problem. https://pagure.io/fedora-infra/ansible/c/003bd6271fa809de99c40157f26d035a52df5176?branch=main

If server.max-worker is removed, then lighttpd can be configured to use a basic log file, and logrotate can be used to rotate the log file and send lighttpd SIGHUP to reopen the log file.

Simplification will improve performance and ease maintenance.

For kicks and giggles, a C version of /tmp/logger runs over 1250x faster than the shell script and "only" 76x faster than the Perl one-line.
$ time cat /tmp/test-file | /tmp/logger /tmp/output-measured
real 0m0.065s
user 0m0.005s
sys 0m0.094s

@gstrauss, awesome! Thank you for all your work done on this!
I agree with most of the stuff written above, but additional experiments
with --prevlink probably don't make much sense.

Sure the shell variant is slow. But it seems acceptable (we have currently hundreds of requests per second) till we get rid of it entirely, see #1263.

As a lighttpd developer, I would be interested in understanding why server.max-workers with a non-zero value is needed with lighttpd.

Thinks probably changed since then, we started using AWS CloudFronts, so
most of the traffic to lighttpd is cached. But I believe the bottleneck was the
directory listing php script which is btw. to be fixed (issue #2009). I need to
fix that and do some benchmarks. @msuchy, am I right, or was there some
other bottleneck? (I wasn't in this team those days).

For kicks and giggles, a C version of /tmp/logger runs over 1250x faster than the shell script

While on it, would you mind shipping that source/binary as a lighttpd helper? If not
used by default with max-users, it would be a very handy opt-in.

But I believe the bottleneck was the directory listing php script

I don't seem to see that script in use in the master branch. Is it even still used? If it is, then it should be moved to a PHP-FPM backend and managed that way, without lighttpd server.max-worker.

lighttpd mod_dirlisting in recent lighttpd releases has been able to do most if not all (and more) than the php script does. lighttpd mod_dirlisting even has a directctory caching option for use when directories change, and (for some reason) generating a (cacheable) index.html periodically is not desirable. mod_dirlisting caching max-age can be tuned, and the default is 15 seconds if mod_dirlsting caching is enabled. On a site creating a heavy amount of directory listings for common directories, but not changing every minute, this is a big win.

While on it, would you mind shipping that source/binary as a lighttpd helper? If not
used by default with max-users, it would be a very handy opt-in.

I would mind. It does not belong with lighttpd. It is a toy piece of code built as a demo in < 30 minutes. It is a small piece of code which copies input pipe to a file, and reopens the file upon SIGHUP. Adding more code to a backwards process does not improve the process. I think that a whole lot of time has been wasted trying to fix the wrong things, just to preserve the existing log counter, when perhaps the log counter should have changed.

And instead of the log counter, I think that removal of server.max-worker should be a priority as that simplifies many things. Subsequently, most of the recent activity and complexity gets deleted, as I have suggested in https://pagure.io/fedora-infra/ansible/pull-request/903

If I have not said it enough:

If server.max-worker is removed, then lighttpd can be configured to use a basic log file, and logrotate can be used to rotate the log file and send lighttpd SIGHUP to reopen the log file.

Simplification will improve performance and ease maintenance.

Please make a one-line change to comment out server.max-worker = 6 and test that everything still works at an acceptable rate. Once that is confirmed, then many simplifications can occur (https://pagure.io/fedora-infra/ansible/pull-request/903) and the existing log counter from logrotate can be used with less complexity and more stability.

I don't seem to see that script in use in the master branch.

It is being discussed in #1263. It is temporarily disabled.

We'll consider fastcgi manager, yes.

Lighttpd mod_dirlisting in recent lighttpd releases has been able to do most if not all (and more)

We need to have a way to programmatically put the backlinks to the frontend machine.
From build results (backend) to build database entry (frontend).

I would mind. It does not belong with lighttpd

The thing is that you forward people to cronolog, even though they don't need
log rotation. And the tool doesn't fit anywhere else better (as long as other
http servers handle this multi-process log syncing just fine).

It is a toy piece of code

Same as the shell toy.. But it would be nice to have it somewhere so users don't
have to think about it (btw., I dislike piping line-by-line read support in Python,
I don't speak Perl that fluently to say it is easy reading too, and for C I have to
manage the builds somewhere). After all, I am sure I can "optimize" the shell script
so it doesn't read at all, but just restarts a background cat process doing the
"expensive" read/write... The point is that I have to think about it...

Please make a one-line change to comment out server.max-worker = 6

I consulted this with @msuchy in person now, and it looks like we added
max-workers because of the I/O-Bound problems mentioned in
docs
.
But it's not a certain thing ... especially considering we moved from
a RAID-something storage (over iscsi) to the AWS gp2 storage since the
max-workers change.

I don't think I understand the original problem, how having more processes
can way-around the disk-seek-bounded storage? Is that max-workers note
still valid with lighttpd v1.4.61+ we have?

The old (in-doc cited) blog post
mentions an option, which doesn't look like something which should be
turned on manually, according to other docs.

We are serving even larger files (not all of them, but still), mostly RPMs.
Installing RPM is usually a transaction .. (foo depends on baz, etc.).
Such a transaction can contain dozens of RPMs, and the clients can
parallelize the downloads.

That said ... I will experiment with max-workers=0, though it is a config
change that needs to be tested, and I don't know if I want to cause a
headache situations to our team before Christmas. Also, thanks for
the PR, we'll
get to that.

JFTR, in the meantime logs were rotated fine with the current setup.
So this is not an immediate issue now..

Lighttpd mod_dirlisting in recent lighttpd releases has been able to do most if not all (and more)

We need to have a way to programmatically put the backlinks to the frontend machine.
From build results (backend) to build database entry (frontend).

I took a look through roles/copr/backend/templates/lighttpd/dir-generator.php.j2 and, yes, lighttpd can do everything that script does, and much, much more quickly. I pushed a small commit with a few lines of javascript to https://pagure.io/fedora-infra/ansible/pull-request/903 to mock up generation of the backlink to the frontend machines, but have not tested it in your environment. To think that dir-generator.php CGI PHP script could have been replaced with 20 lines of javascript years ago...

I consulted this with @msuchy in person now, and it looks like we added
max-workers because of the I/O-Bound problems mentioned in
docs
.
But it's not a certain thing ... especially considering we moved from
a RAID-something storage (over iscsi) to the AWS gp2 storage since the
max-workers change.

On a RAID storage, presumably on a beefy system with sufficient memory, I doubt that lighttpd server.max-worker made much of a difference. Was the impact measured at the time?

Please make it a priority (before #1263 you referenced above) to measure the performance with and without server.max-worker. Commenting out server.max-worker is a safe change and is extremely unlikely to break anything. If something went really wrong during testing, responses might slow down, but in that case server.max-worker could easily be re-enabled. However, I hope you find as I predict, that there will be little to no change in performance. Instead, I think that there is other low-hanging fruit to improve performance, such as adding Cache-Control headers to responses which allow caching, something not currently configured in the copr lighttpd.conf. I have also suggested short-lived caching of directory listings in https://pagure.io/fedora-infra/ansible/pull-request/903

I don't think I understand the original problem, how having more processes
can way-around the disk-seek-bounded storage? Is that max-workers note
still valid with lighttpd v1.4.61+ we have?

To try to briefly answer your question with an example: iff a request for a random RPM results in an I/O bound access which blocks (in the kernel) the single threaded lighttpd process, and kernel read-ahead is not sufficient to minimize the blocking, then lighttpd is unable to serve other simpler requests -- such as for a .css or index.html which might already be in memory -- while the kernel blocks lighttpd on the I/O. On the other hand, multiple requests for random RPMs may all pause momentarily if the kernel filesystem cache is cold for those RPMs, so server.max-worker will not help much in that case.


I would mind. It does not belong with lighttpd

The thing is that you forward people to cronolog, even though they don't need
log rotation. And the tool doesn't fit anywhere else better (as long as other
http servers handle this multi-process log syncing just fine).

lighttpd supports logging to syslog, which handles multi-process log syncing just fine. lighttpd supports logging to a piped logger, which handles multi-process log syncing just fine. cronolog is just one well-known suggestion. As you know cronolog provides the feaure of log rotation, but you do not have to use it, so I do not understand what point you are trying to make.

Separately, you found a bug in lighttpd when the piped logger is killed and server.max-workers is in non-zero. (https://redmine.lighttpd.net/issues/3123) Thank you for reporting it. The bug has been fixed in lighttpd development. That said, I have stated (many times) here that you probably do not need to be using server.max-worker, and if you did not use server.max-worker then you would not be affected by the bug in lighttpd, which you were tripping over frequently only because you are intentionally killing cronolog, something most people using cronolog do not do.

After all, I am sure I can "optimize" the shell script
so it doesn't read at all, but just restarts a background cat process doing the
"expensive" read/write... The point is that I have to think about it...

If you can confirm that you do not need server.max-worker, then you also do not need cronolog or your inefficient shell script, and the entire problem in this ticket just goes away, replaced by a simpler solution. That's a lot less to think about...

I do not understand what point you are trying to make.

Simply, just feedback, nothing more: If anyone has a real need to start server.max-worker option, they can not use logrotate. If you don't mind, it is completely OK - no objection (I'll maybe blog a bit about this case, and about a possibly working helper later).

We'll try to confirm that (a) max-worker can be dropped, (b) if we can move to FPM, and (c) check the directory listing module. Thanks a lot for the suggestions!

We'll try to confirm that (a) max-worker can be dropped, (b) if we can move to FPM, and (c) check the directory listing module.

Thanks. I hope you'll find the commit in https://pagure.io/fedora-infra/ansible/pull-request/903 is much more efficient than PHP-FPM to run the custom dir-generator.php

Simply, just feedback, nothing more: If anyone has a real need to start server.max-worker option, they can not use logrotate.

I think that you are mischaracterizing that issue. Above, I tried to describe your specific use of cronolog (not using cronolog for log rotation), and your abuse of killing cronolog to "force-reopen" the log file after logrotate, while also using lighttpd server.max-worker with a non-zero value. The bug you found in lighttpd in that (presumably) rare use case is already fixed in lighttpd development: https://redmine.lighttpd.net/issues/3123

Thank you for changing the docs!

I think that you are mischaracterizing that issue. Above, I tried to
describe your specific use of cronolog (not using cronolog for log
rotation) ...

Perhaps I'm just misguided by the other docs, and biased by the fact that
I had to dive too deep contrary to my wishes. :-)

Check the accesslog docs:

If you have multiple workers and want to rotate logs, use a piped
logger, e.g. accesslog.filename = "|/usr/sbin/cronolog .

I think that at least this isn't ideal, something like this would be better:

If you have multiple workers, use a piped logger. Cronolog might be a
good option if you also want to automatically rotate the logs.

your abuse of killing cronolog to "force-reopen" the log file after
logrotate, while also using lighttpd server.max-worker with a non-zero
value. The bug you found in lighttpd in that (presumably) rare use case
is already fixed in lighttpd development:
https://redmine.lighttpd.net/issues/3123

I agree it was a random attempt (=aka abuse). The feedback I'm giving:

What am I supposed to do if (a) logrotate is needed, and (b) more workers
are needed as well? Is this so extreme example? Even with the fixed
code, I see no other option than (a) restart lighttpd by logrotate, or
(b) implement my own piped logger.

Perhaps this would be helpful:

When piped logger is used, use of other external log-rotation mechanisms (e.g. logrotate) is discouraged

Unless we find some better wording? I really understand your attempts to
fight against the 'max-workers' option, but still the docs admit that people
are using that option.

What am I supposed to do if (a) logrotate is needed, and (b) more workers
are needed as well? Is this so extreme example? Even with the fixed
code, I see no other option than (a) restart lighttpd by logrotate, or
(b) implement my own piped logger.

As noted in https://redmine.lighttpd.net/issues/3123:

  • The bug you identified is fixed (lighttpd 1.4.64 and later)
  • A workaround is immediately available for earlier versions: disable time jump detection
    (which is more important for embedded systems without persistent and stable clocks)
    server.feature-flags += ("server.clock-jump-restart" => 0)
  • Restart lighttpd with immediate graceful restart
    server.feature-flags += ( "server.graceful-restart-bg" => "enable" )
    server.systemd-socket-activation = "enable"

With any one of the three solutions above, your "abuse" to kill cronolog
(or restart lighttpd to kill cronolog) for your logrotate setup would operate as
you expected and would not require you to implement your own piped logger.
Your comment suggests that you are unaware of other piped loggers besides cronolog.
Two such examples are DJB multilog and Apache rotatelogs. I have no doubt there are others.

I really understand your attempts to fight against the 'max-workers' option,
but still the docs admit that people are using that option.

The wording is there because so many people, including yourself,
probably should not be touching some of the tunable knobs in lighttpd,
particularly server.max-worker. This is similar to how someone should
not modify kernel tunables unless they have a reason to do so,
understand the limitations and potential impact of the change,
and know how to test the change. That is generally good advice.

examples are DJB multilog and Apache rotatelogs

What is needed is just plain pipe.. No alternatives to cronolog, or log
maintenance mechanism.. But I got the point that implementing/documenting
logging for max-worker isn't worth it. That option should be
dropped. And I get that "cat" would help us here, we just need to wait
for the fixed lighttpd version (but note: killing it by logrotate is still
marked as "abuse").

including yourself,

This is not personal (I haven't touched the internals so far). I just don't change
configuration without a reason.. if the system works.

server.clock-jump-restart
server.graceful-restart-bg
server.systemd-socket-activation = "enable"

Just note the irony where on one hand you claim we shouldn't touch foo,
but we should touch baz. /me is moving on

Ok, the original problem is fixed. Thanks for the discussion, we will slowly
take a look at the enhancements proposed in #2011.

Metadata Update from @praiskup:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

FTR: svlogd is a piped logger which reopens log files when sent SIGHUP

Log in to comment on this ticket.

Metadata