#1311 Disable syscall auditing by default
Closed None Opened 9 years ago by amluto.

= phenomenon =

On a default Fedora install, system call auditing is enabled by default. This is somewhat useful, but it also has large costs.

= background analysis =

Linux has a fast path for system calls. It's been heavily tuned over the years, and it's very fast. The kernel can transparently enable and disable it per process. Seccomp and ptrace force the slow path to be invoked for affected processes, but system call auditing forces it to be invoked for pretty much every process.

On my laptop, syscall(SYS_getpid) takes 241.65 ns if syscall auditing is on (even with no rules) and and 70-ish ns if syscall auditing is off.

On my desktop, getpid takes ~65.5 ns on a default Fedora configuration (I think -- this system has been through a few upgrades); if I disable syscall auditing, getpid goes down to 40.37 ns.

For heavy network users, this may be a large cost.

The in-kernel syscall auditing code is also rather large and crufty. For example, CVE-2014-3917 has apparently been lurking there for many years. Admittedly, a default Fedora configuration is unaffected by CVE-2014-3917, but there could be other problems in there.

As far as I know, syscall auditing users fall into two camps. There are people who need it for Common Criteria or other compliance reasons, and there are people who use it for selinux troubleshooting. I would argue that it has very limited value for selinux troubleshooting: AVC denials are logged regardless of whether syscall auditing is on by default.

Note that many architectures do not support system call auditing, so there are Fedora users who cannot benefit from it.

= implementation recommendation =

Here are four possible changes that could be made:

  1. Improve the kernel to reduce syscall auditing overhead. I tried this and gave up. The code is a mess, and the assumptions it makes about audit context setup are hard to disentangle.

  2. Turn off CONFIG_AUDITSYSCALL on Fedora kernels. This is the heavyweight solution and will make Fedora kernels unusable for people who need auditing for Common Criteria and such.

  3. Remove auditd from the default install. This will break setroubleshootd and is probably a bad idea.

  4. Add something like this to {{{/etc/audit/rules.d/audit.rules}}}:
    {{{

This suppresses syscall auditing for all tasks started

with this rule in effect. Remove it if you need syscall

auditing.

-a task,never
}}}
This should eliminate most of the overhead and attack surface, and I suspect that people who actually need syscall auditing are modifying {{{audit.rules}}} anyway. It will have some impact on the quality of selinux failure logs, but I don't expect this to be a big problem.


Can you give some application-level benchmarks demonstrating the real-world impact?

Sure.

I ran fio using this configuration:
{{{
[reads]
norandommap
time_based
runtime=10
ioengine=psync
size=256k
bs=8
filename=/tmp/testfile
rw=read
}}}

In the default configuration, perf says:
{{{
53.45% fio [unknown] [.] 0x0000000000409d32
3.17% fio [kernel.kallsyms] [k] shmem_file_aio_read
3.02% fio [kernel.kallsyms] [k] page_waitqueue
2.27% fio [kernel.kallsyms] [k] vfs_read
2.25% fio [kernel.kallsyms] [k] __audit_syscall_exit
2.04% fio [kernel.kallsyms] [k] shmem_getpage_gfp
1.87% fio [kernel.kallsyms] [k] find_get_entry
1.87% fio [kernel.kallsyms] [k] __radix_tree_lookup
1.86% fio [kernel.kallsyms] [k] fsnotify
1.84% fio [kernel.kallsyms] [k] __fget_light
1.83% fio [kernel.kallsyms] [k] copy_page_to_iter
1.80% fio [kernel.kallsyms] [k] do_sync_read
1.76% fio [kernel.kallsyms] [k] system_call
1.73% fio [kernel.kallsyms] [k] system_call_after_swapgs
1.65% fio [kernel.kallsyms] [k] unlock_page
1.59% fio [kernel.kallsyms] [k] find_lock_entry
1.49% fio [kernel.kallsyms] [k] put_page
1.47% fio [kernel.kallsyms] [k] copy_user_generic_string
1.29% fio [kernel.kallsyms] [k] __audit_syscall_entry
1.16% fio [kernel.kallsyms] [k] sys_pread64
}}}

In the default configuration, I get 2094k IOPS, and after {{{auditctl -a task,never}}}, I get 2412k IOPS. There's a fair amount of noise, but rerunning it a couple times gets similar results. Oddly, the speedup is consistently much more than I'd expect from just the {{{__audit_syscall}}} stuff, which suggests that there's more overhead than can be accounted for by just time spent in that function. The assembly slow path may account for some of the extra overhead.

I should add an partial option !#5: change audit.service so that it does not enable kernel auditing by default. I don't know how to do this cleanly, and the audit code has the odd property that every task that starts after audit has ''ever'' been enabled will have auditing overhead. So enabling and re-disabling auditing doesn't actually work. The {{{-a task,never}}} thing is a separate control that doesn't have this oddity.

Adding audit maintainer for comment here.

I am also adding SELinux maintainers as they should have a say whether affecting the SELinux troubleshoot reports this way is significant or not.

I don't want to speak for dwalsh or mgrepl as they are the SELinux policy gurus here, but as someone who is often just passed an AVC/SYSCALL audit record with little in the way of a reproducer and asked to "fix it!", having the extra information in the syscall record can be very helpful. While there are short term performance gains to be had by turning off syscall auditing, losing the extra information in the audit record will likely increase the number of problems/bugzillas that I'm unable to resolve.

I realize the audit code is slow and not very pretty, but I'd much rather we work on improving the code rather than disabling it.

Can you give an example where the syscall record is helpful? I haven't been able to come up with one myself.

Note that, in theory, the kernel could try to figure out whether it's an a syscall when a selinux denial happens and print out the syscall number and args. I think the only piece missing is a way to determine when we're in a syscall in the first place. Adding that might be feasible.

Off the top of my head, the example that springs to mind are a lot of the socket/networking syscalls (excluding 32-bit x86). Being able to look at the syscall arguments can be very helpful in determining what the application is doing and what code paths are being exercised in the kernel.

If we can figure out a way to capture the information from a syscall audit record without having to do full syscall auditing, that is fine with me. When we hit a SELinux denial, having to do a little extra work isn't a problem as we're no longer in the "common case" as far as performance is concerned.

If you know you're in a syscall, then, at least on x86, you can do this:

{{{
unsigned long args[6];
int nr = syscall_get_nr(current, current_pt_regs());
syscall_get_arguments(current, current_pt_regs(), 0, 6, args);
}}}

Lemme see if I can get a clean way to do this into -next.

The reporting, search, analysis tools are sensitive to any changes in an event's fields. It would be a lot of work in user space side to fix up analysis if AVC's did not have syscall records.

My thoughts on this is that some work should be done to see if we can pinpoint where performance bottleneck is coming from. Back on RHEL5 audit was pretty quick. I think the performance team looked at it around 6.1's time frame also. But there has been a whole lot of work upstream since then that maybe something got changed for expedience without noticing the performance change? Just guessing. But I think its better to fix that problem than turning off audit.

The performance issue is easy to understand: the audit code hooks syscall entry and exit, and that hook forces some slow path code to run. And I think the fact that the tools expect separate syscall records is solidly in the category of "fix the tools".

I sent some [http://lkml.kernel.org/g/%3Ccover.1401486790.git.luto@amacapital.net%3E patches] that could provide a nice alternative.

Meta-question: Is this really the right place to discuss kernel patches, and do essentially ''research'' of performance optimization?


I’m generally fine with requiring users that want to configure non-default audit rules to flip one more option to enable full audit collection, with two reservations:
1. The user-space audit tools should make it really difficult for the user to add rules without enabling the necessary infrastructure.
1. The proposed mechanism to disable/enable audit collection needs to 100% work for the audit case, e.g. it must not lead to having auditing disabled for early processes with no way to reenable (this is hypothetical, proposal no. 4 is AFAIK fine in this regard).

For completeness, note that the SYSCALL records are output also for non-SELinux cases, though all I have in my log is AUDIT_NETFILTER_CFG, which is IMHO not essential, at least for the default configuration.


If talking about SELinux, what about other record types, in particular AUDIT_PATH records? These can’t be just extracted in the post-failure path because rereading the user-space memory that contained the pathname won’t necessarily give you the same contents that were actually used for the pathname lookup.

I guess it could be argued that this race condition is not relevant for debugging-intended SELinux messages; I don’t know, at least I’d say that such possibly fake information should never end up in the actual audit log. Anyone care to argue for ignoring the race in more detail?

Separately, how much are AUDIT_PATH, and other non-AUDIT_SYSCALL, records useful for SELinux debugging?

Replying to [comment:14 mitr]:

Meta-question: Is this really the right place to discuss kernel patches, and do essentially ''research'' of performance optimization?

Andy filed a kernel bug to turn off the kernel config option entirely to start with. That's not appropriate given that it would impact SELinux, audit, and the default Fedora installs. So I told him to bring up the issues with FESCo first.

The actual performance improvement suggests and changes can be discussed elsewhere, but the overall "should we disable/limit syscall auditing" is certainly a FESCo topic.

Replying to [comment:14 mitr]:

I’m generally fine with requiring users that want to configure non-default audit rules to flip one more option to enable full audit collection, with two reservations:
1. The user-space audit tools should make it really difficult for the user to add rules without enabling the necessary infrastructure.

Proposal number 4 does this: the line you'd have to remove/comment out is in the same file that you'd edit to add rules.

  1. The proposed mechanism to disable/enable audit collection needs to 100% work for the audit case, e.g. it must not lead to having auditing disabled for early processes with no way to reenable (this is hypothetical, proposal no. 4 is AFAIK fine in this regard).

Proposal number 4 isn't quite there. If you remove '-a task,never' and then add some new rules, pre-existing processes might not respect the new rules until a restart. I would argue that this is a minor issue.


If talking about SELinux, what about other record types, in particular AUDIT_PATH records? These can’t be just extracted in the post-failure path because rereading the user-space memory that contained the pathname won’t necessarily give you the same contents that were actually used for the pathname lookup.

From my minimal understanding of exactly how the "denied" records work, they have their own mechanism for this. Here's an old "denied" record from my logs:

{{{
type=AVC msg=audit(1398474953.696:389): avc: denied { unlink } for pid=681 comm="systemd-readahe" name=".readahead" dev="dm-1" ino=3305 scontext=system_u:system_r:readahead_t:s0 tcontext=system_u:object_r:file_t:s0 tclass=file
}}}

There's a name, inode, and full context information. I don't think this has anything to do with the AUDITSYSCALL stuff, but I could be wrong.

Per http://lwn.net/Articles/600933/ : turning off CONFIG_AUDITSYSCALL won’t fly because it would break loginuid and session tracking.

(I’d really appreciate more analysis from people who understand the details of the audit code and the ecosystem. Yes, I can now go take a look whether the proposed -a task, never interacts with session tracking in any way, but I’m fairly sure nobody wants FESCo decisions to depend on my cursory understanding of all of this code, when it’s clear by this aspect that I don’t know what questions to ask.)

Hi. I wanted to mention that before we do anything, I'd like to get all the facts in place. I don't have time to dig into this issue this week, prior commitments. And Eric Paris needs to look into this also and I don't think he'll be around this week either. But once upon a time, a check was placed in syscall entry to see if there were any rules loaded and bail out if not. This kept the audit system fast for anyone not needing audit. Somewhere along the way, I think the entry logic got changed (perhaps for expedience of coding) and its now heavier. This may have been for SE Linux's purposes - I don't remember. So, what I'd like to do is performance check rhel5 vs rhel6 vs rhel7 to when things changed. I think the performance team checked auditing on rhel 6.1. Previously, the cost of audit was a couple 'if' statements when no rules were loaded. So, if it regressed between 6 & 7, we have a starting point to figure this out.

I'd really prefer getting an understanding of what happened and why and fix the code if possible without any changes to audit record formats. The suggested rule will remove the TIF_SYSCALL_AUDIT flag from the process rendering it inauditable until system reboot. This means you might not be able to reproduce a selinux denial and get usable info.

Sounds good. I'll tag this for the FESCo meeting but not put it on this week's agenda.

This would mean we would loose some information that can be used by
setroubleshoot to help diagnose an issue. But not sure how often this
helps. setroubleshoot attempts to recreate the full path of the target
object using the inode, name and device and slocate. I have not checked
recently if this still works. Not having the full path of the target has always
been a pain, but we have had this problem, unless people turn on full auditing.

From Steve question above, is full auditing on for everyone? I run with it on my machine for the Audit PATH.

type=PATH msg=audit(1402271515.893:480): item=0 name="/dev/urandom" inode=26969 dev=00:28 mode=020666 ouid=0 ogid=0 rdev=01:09 obj=system_u:object_r:tmpfs_t:s0 nametype=NORMAL

Knowing that the actually syscall was execv helps us to diagnose that the avc was a leaked file descriptor.

Some times I look to see what the syscall was, but not often.

I don't believe it would be devastating to SELinux diagnosis, if this was turned off by default. Iff it is a big performance improvement. But I agree with Steve, that we should see if this is a regression in code or if the problem has always been there.

I won't be at today's FESCo meeting where this will be discussed, but in general my inclination is to defer to the experts already involved in the conversation here. (That is, if someone is willing to dig into whether there is a performance regression which could be fixed, that would be nice, but in general there seem to be few downsides to the proposal. I ''would'' love to see more numbers.)

v3.10: 71ns vs 123ns

{{{
3.1.0
Without audit: 10000000 loops in 0.70769s = 70.77 nsec / loop
AUDIT_STATUS: enabled=1 flag=1 pid=0 rate_limit=0 backlog_limit=64 lost=0 backlog=0
With audit: 10000000 loops in 1.22076s = 122.08 nsec / loop
Done
}}}

v2.6.28 doesn't even build -- my copy of make doesn't like the Makefile.

v2.6.30 doesn't build either: it doesn't like my binutils.

I'm set up now to very easily test a bzImage, though.

I can do one better. Download my initramfs [1], and then do:

{{{
qemu-kvm -kernel BZIMAGE -initrd irfs --nographic -append console=ttyS0
}}}

It's a canned test.

[1] http://web.mit.edu/luto/www/fedora/irfs This link is temporary, and I will provide source on request, although you probably don't actually want the source, since it's a hacked virtme-mkinitramfs output from Feodra 20, which means it's a random assortment of scripts and binaries.

i'm going to abstain from voting on this, this week, since i'll miss the meeting. but frankly, i'm unconvinced unless it shows up as a big loss on macro-benchmarks. i know linus has chided us in the past for enabling it, but he did some quite substantial work to speed it up several years back for us.

As another data point, I got a bug report yesterday from Gentoo. It looks like the 32-bit sysenter path is quite broken wrt syscall auditing and can oops.

I really think that no one except maybe some black-hats have looked at the syscall audit code seriously in a long time, if ever.

Any further news here? I know last week we were waiting for some more benchmarking...

It will probably be a couple weeks before we have any data. Also, one bad effect of disabling syscall auditing is that to get syscall or watch events, the user will have to reboot their system to get it to work. This is non-obvious.

They have to reboot to get audit contexts allocated for existing tasks? Maybe that is a reason to do it in kernel space instead of userspace... But for now, this request can certainly be done as a one line change in the default config file...

There was a longish thread involving me and Oleg Nesterov in which I proposed a way to dynamically turn on auditing for existing tasks, but my code was wrong and fixing it is nontrivial. If anyone else wants to pick it up, be my guest.

From the June 25 meeting:

AGREED: turn off syscall auditing for all processes by default. we will reevaluate when we get info from the perf team (6-0-0)

I don't know if anyone here has noticed, but systemd is enabling the audit system. If they are allowed to enable it, why can't I? What was the point of this exercise if they don't follow this recommendation?

Replying to [comment:34 sgrubb]:

I don't know if anyone here has noticed, but systemd is enabling the audit system. If they are allowed to enable it, why can't I? What was the point of this exercise if they don't follow this recommendation?

I'd be treating this as a bug; have you filed this in BZ? Please link it if so.

Much of the discussion is on bz #1227379, to which I attached a patch to revert the audit enabling code in systemd. I suppose a new bz could be opened just for this.

Unless I'm missing something, the issue is that systemd enables auditing, which isn't the same thing as syscall auditing, and this FESCo ticket is about syscall auditing.

(FWIW, I agree that having journalctl enable auditing by default is backwards, but I think that's a separate issue.)

The default rule set that auditd loads only has one difference from what journald is doing: it increased the backlog buffer to 320 while journald doesn't.

https://fedorahosted.org/audit/browser/trunk/init.d/audit.rules

Otherwise they are identical.

. mitr pipes up

This ticket was about disabling ''syscall'' auditing, and was resolved by http://pkgs.fedoraproject.org/cgit/audit.git/tree/never-audit.patch . Pointing at upstream unpatched audit.rules is therefore at best irrelevant and at worst actively misleading.

If you want to argue that systemd should not be forcefully enabling auditing, at all, or at least in such an unexpected place, prima facie I might agree. And if you want to ask FESCo to intervene, by all means do so — but please clearly file a separate ticket. Not everyone here has the long history and full domain knowledge to quickly tell the difference.

Lets not conflate the syscall auditing issue with the systemd enabling issue; they are completely independent and in AFAICS there is no way deciding on one of them suggests or supports deciding in any way on the other one.

(To be clear, I am not on FESCo and have no authority in this discussion.)

. mitr disappears again

So the fix in this ticket may be incomplete.

If systemd enabled audit and auditd starts (with this fix) only process started after systemd enabled audit and before it started auditd will get the 'go slow' flag. Auditd is launched VERY early, so this is not a big concern.

If systemd enabled audit but auditd never runs all processes will get the syscall audit flag and will thus 'go slow'. Before systemd got involved removing/disabling auditd was enough to gain back the performance hit.

I might suggest we ask auditd to automatically load the -a never rule. Auditd can flush that rule when it starts.

Sorry, I have missed the audit-not-installed case.

Do you mean “ask the kernel” to automatically load the rule? If we are concerned about the situation in which auditd is not installed at all, there are not that many places which can put the default rule in: basically either kernel, or systemd, and they would need to coordinate with auditd for the rule removal.

Ugh. journald not touching the audit enable flag ''would'' be so much easier.

This is just broken.

  • auditd knows how to load rules and auditd knows how to set the global enable flag if configured to do so.

  • systemd does not know how to load rules and apparently doesn't offer any configuration for the global flag.

So why exactly is Fedora using systemd's audit support in the first place? It appears to conflict with auditd and it appears to be incomplete. Can we either patch out systemd's audit support or ask the systemd people to implement complete, useful (i.e. Common Criteria compliant, etc) audit support?

I've poked the bug to see if we can get the immediate solution implemented in systemd to disable syscall auditing. We'll start there and then address some of the stuff Andy brings up in comment 42 later.

Login to comment on this ticket.

Metadata