Note: this proposal has serious problems, see Problems.
Additionally, both SA_RESTART handlers and non-SA_RESTART signal handlers are supported, so one can just register the right kind of handlers for the right effect, which is how this problem should be fixed.
Still open is whether msgsnd breaks on EINTR or not.
select and poll might.
EINTR is used by UNIX syscalls in order to solve the PC loser-ing problem.
UNIX generally tries to make everything blocking in order to maintain the illusion that each process is the only one using the machine.
However, say that a process tries to read something from a source and some outside process wants to reconfigure this source.
In that case, some way to interrupt system calls (on purpose) is required.
Also, in the case of an (unsolicited) signal arriving, it's easier for the kernel not to have to remember state of running syscalls while the signal handler is running.
For both cases (signal that should cause an error, signal that should not cause an error), errno == EINTR is used, and there are some other status codes that also mean "interrupted", without errno == EINTR.
It's clear that now, on being interrupted, one has the problem of deciding whether or not to break.
The whole point of blocking is that one doesn't want to put this decision at each call site.
It's thus favorable to instead make it global (thread-local).
Provide a global intr handler callback and userdata. The callback should have the signature
int libc_handle_intr(void* userdata);
The userdata should be
__thread void* libc_intr_handler_userdata;
And the callback variable thus should be
__thread int (*libc_intr_handler)(void* userdata);
The returned value is (-1) if the intr should still be reported to the caller, or 0 if the intr was handled in some way.
After returning (-1) once, the callback is supposed to keep returning (-1) until the user resets something manually.
On EINTR (and other detectable circumstances), the standard library should call this libc_intr_handler callback in a loop, for example:
while(interruptible_close(fd) == -1 && errno == EINTR) { if(call_libc_intr_handler() == -1) { errno = EINTR; return (-1); } //else //errno = 0; } return 0;
For read, the situation is more complicated:
// Note: could be optimized for bufsize == 1, which is probably never going to be used. assert(bufsize <= SSIZE_MAX); ssize_t accucount = 0; ssize_t count = 0; assert(bufsize > 0); // Note: could unroll one of the calls in anticipation of EINTR not to happen. while(bufsize > 0 && ((count = interruptible_read(fd, buf, bufsize)) >= 0 || errno == EINTR)) { assert(count <= bufsize); if(count < bufsize) { // maybe EOF, maybe interrupted, maybe error. // especially not count == bufsize if(count == 0) // EOF return accucount; else switch(call_libc_intr_handler()) { /* count > 0 || (count < 0 && errno == EINTR) additionally, though errno may have changed by now */ case (-1): //bufsize = (count > 0) ? count : 0; /* makes sure to stop after this block */ accucount += (count > 0) ? count : 0; /* FIXME check for overflow */ if(accucount == 0) { errno = EINTR; return (-1); } else return accucount; break; default: // it is ensured that bufsize > 0 and thus interruptible_read() will be called again. //if(count < 0) // errno = 0; } if(count < 0) count = 0; } buf += count; /* FIXME check for overflow */ bufsize -= (size_t) count; /* safe as long as the read syscall ensures count <= bufsize */ /* safe signed-unsigned mismatch */ accucount += count; /* FIXME check for overflow */ } return accucount;
Note that a read that doesn't advance through the buffer wouldn't work since it would be impossible for the caller to be able to discern whether there was a short read because of (intentional) EINTR or not:
while((count = interruptible_read(fd, buf, bufsize)) == (-1) && errno == EINTR) {
if(call_libc_intr_handler() == (-1)) {
errno = EINTR;
return (-1);
}
}
if(count >= 0 && count < bufsize)
if(call_libc_intr_handler() == 0)
WHAT NOW?!?!
return count;
For write:
assert(bufsize <= SSIZE_MAX); ssize_t accucount = 0; ssize_t count = 0; assert(bufsize > 0); while(bufsize > 0 && ((count = interruptible_write(fd, buf, bufsize)) >= 0 || errno == EINTR)) { assert(count <= bufsize); if(count < bufsize) { // maybe interrupted, maybe error. // especially not count == bufsize //if(count == 0) // this would not ensure global progress so I doubt that the syscall returns that. switch(call_libc_intr_handler()) { /* count > 0 || (count < 0 && errno == EINTR) additionally, though errno may have changed by now */ case (-1): //bufsize = (count > 0) ? count : 0; /* makes sure to stop after this block */ accucount += (count > 0) ? count : 0; /* FIXME check for overflow */ if(accucount == 0) { errno = EINTR; return (-1); } else return accucount; break; default: // it is ensured that bufsize > 0 and thus interruptible_write() will be called again. //if(count < 0) // errno = 0; } if(count < 0) count = 0; } buf += count; /* FIXME check for overflow */ bufsize -= (size_t) count; /* safe as long as the read syscall ensures count <= bufsize */ /* safe signed-unsigned mismatch */ accucount += count; /* FIXME check for overflow */ } return accucount;
For readv, there's the problem that the manual specified it to be atomic. How this is supposed to be possible in the face of EINTR is anyone's guess. I suspect that the documentation is wrong.
The libc_intr_handler shall be setable by the user. It's thread-local so that should be no problem. Note that libc_intr_handler == NULL is invalid.
Introduce public routines to ensure that:
void set_libc_intr_handler(void* userdata, int(*alibc_intr_handler)(void* userdata)) { if(alibc_intr_handler == NULL) abort(); libc_intr_handler = alibc_intr_handler; libc_intr_handler_userdata = userdata; } void get_libc_intr_handler(void** userdata, int(**alibc_intr_handler)(void* userdata)) { *userdata = libc_intr_handler_userdata; *alibc_intr_handler = libc_intr_handler; } inline int call_libc_intr_handler(void) { return (*libc_intr_handler)(libc_intr_handler_userdata); }
The main reason for the userdata argument is in order to be able to store the previous handler and chain handlers, if desired.
There are multiple possible default handlers, the one providing backwards compatibility (i.e. breaking as before) would be:
int libc_default_handle_intr(void* userdata) { return (-1); }
Then there should be a normal (unused by default) handler, the one providing unsolicited-EINTR immunity:
int libc_automatic_handle_intr(void* userdata) { return 0; }
The "nice" way would be to only adapt TEMP_FAILURE_RETRY to call it since it's meant for that anyway. Disadvantage is that almost no one uses it, not even libio. Then you send a signal and it's either silently ignored or interrupts unexpectedly at places where TEMP_FAILURE_RETRY is not used. I don't think a whack-a-mole approach works in this case. Also, frankly, doing the right thing should be automatic.
The safe way is to adapt the lowlevel syscall wrappers themselves (and eventually yank out all the other higher-level handlers over time, no rush).
Affected Linux syscalls in glibc 2.19 (each lists the newest version only):
E futex E semop E semtimedop E pause f ptrace (FIXME) E rt_sigsuspend E T rt_sigtimedwait E sigsuspend E waitid E waitpid E creat E open E openat E B read E B write close (DO NOT retry on EINTR) E U pselect6 E dup E dup2 E dup3 E T epoll_pwait E T epoll_wait E fallocate E fcntl64 E flock E ftruncate64 E truncate64 E T poll E T ppoll E T io_getevents (timeout not modified) (libaio) (only for the direct syscall! otherwise F) E fstatfs64 E statfs64 E accept4 E connect E B recv E B recvfrom E B recvmsg E B send E B sendmsg E B sendto E request_key e t clock_nanosleep E t nanosleep E B t mq_timedreceive E B t mq_timedsend E B msgrcv E B msgsnd E B preadv E B readv E B pwritev E B writev E U newselect E B getrandom Meaning: E ... returns (-1), sets errno == EINTR e ... returns EINTR directly. f ... returns (-1) whenever it feels like it, but sets errno on errors. F ... returns less than min_nr on error. B ... have to adjust buffer T ... have to adjust timeout. Note that timeout is optional usually. t ... have to adjust timeout, but that's easy. Note that timeout is optional usually. U ... have to adjust timeout, but "timeout is undefined". Note that timeout is optional usually.
Would it make sense to also patch the meta-level syscall() interface?
Probably, if gcc can be made to evaluate the syscall-is-eintr check at compile time.
Unchecked:
No manual entry for clock_adjtime
pthread_create must set up libc_intr_handler and libc_intr_handler_userdata.
Some EINTR loops are there (sigtimedwait, pclose, thrd_sleep, pthread_rwlock_timedwrlock, pthread_mutex_timedlock, pthread_rwlock_timedrdlock, pthread_cond_timedwait, pthread_cancel, system, (close)), but does not call callback (obviously).
g_file_get_contents handles EINTR, but does not call callback (obviously).
Many loops repeated all over the place, sometimes goto.
There needs to be a way to feature-test whether libc_intr_handler functionality is available or not.
#define INTR_HANDLER_CALLBACK_POINTER_EXISTS
or my personal favourite (really):
#define call_libc_intr_handler call_libc_intr_handler
It would be nice if there was a compile-time flag to find out whether a non-(-1)-returning callback is used. Can't see how that would work.
The variables libc_intr_handler and libc_intr_handler_userdata should be accessible but read-only to clients (for possible inlining benefits).
Adapt TEMP_FAILURE_RETRY to do this as well. This is so that existing libraries and applications will use the new callback, whether they know about it or not. With the new feature, technically TEMP_FAILURE_RETRY won't have to retry. But before it does, it will call the callback and that will again return (-1), thus it won't retry.
The programs that handle EINTR now usually do so in a loop without calling anything else. The new feature changes things in such a way that EINTR is now sometimes supposed to fail with error (and sometimes not). Therefore, these programs should not do that anymore. If they have to, they can use the (extended) macro TEMP_FAILURE_RETRY instead of their own loop. If they don't, it won't be better or worse than now.
What if, right before the interruptible syscall, our signal handler runs, setting a flag, and then we run the interruptible syscall? We block.
(Then the signal handler should make another signal pending, and the main program make it unpending. Ugh. This doesn't work, does it?)
Personally, I think EINTR was a mistake and BSD signal restart semantics are the correct way, more implementation complexity or not.
However, EINTR does have a use (detecting signals you actually wanted to interrupt you) and it's a reality we have to live with.
I spent several days tracking down a bug where EINTR was reported to the main program even though it was actually SIGINT for the child process.
Then I spent several more days tracking down all the libraries where that is wrong (and started patching them).
Eventually I saw that almost no one is equipped to handle both ways of EINTR and the right place to fix this (and especially, to put the callback variable) is at the low-level interface: in the glibc syscall wrappers. I think this was never meant to leak out into user programs that much. And if it does get passed to user programs, the user program looping without checking anything in the environment first is... weird.
I know that this is a change in behaviour, but I think a routine reporting fewer error codes than before is not a problem when it's opt-in.
Q: Why not just stop installing signal handlers without SA_RESTART?
A: There are system calls that do not heed SA_RESTART, for example msgrcv. Also, sometimes you want to know that a signal happened, immediately.
Also, SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU are not optional (although many signals are restartable nowadays).
It is (somewhat) possible to make the performance only worse when EINTR is actually encountered. For example, the syscall wrapper could try the interruptible call immediately and only then handle EINTR.
It is assumed that EINTR happens rarely, so the performance overhead when calling the callback (on EINTR) is not deemed significant.
While I can do the implementation work, it would be nice if some distribution maintainers could make statistics on how many packages have manual EINTR checks and what it would take to use TEMP_FAILURE_RETRY instead, f.e. on the build servers:
grep -rl '\<EINTR\>' .
Use signalfd instead.
All of the syscalls functions, except the following are restarted if a signal with SA_RESTART is received and the signal handler returns.
msgop, msgrcv(), msgsnd(), semop(), semop, semtimedop(), semwait, select, poll(), epoll_wait(), sigtimedwait(), sigwaitinfo(), sleep, pause, sigsuspend.
I, Danny Milosavljevic, put this document in the public domain.