The file descriptor that represents the read side of a pipe ( RFD ) is added inside the epoll device.
Pipe writer writes 2Kb of data on the write side of the pipe.
A call to epoll_wait(2) is done that will return RFD as ready file descriptor.
The pipe reader reads 1Kb of data from RFD.
A call to epoll_wait(2) is done.
If the RFD file descriptor has been added to the epoll interface using the EPOLLET flag, the call to epoll_wait(2) done in step 5 will hang because of the available data still present in the file input buffers. The reason for this is that Edge Triggered event distribution delivers events only when the status of a monitored device changes from I/O space not available ( state 0 ) to I/O space available ( state 1 ). In the above example, an event on RFD will be generated ( supposing that the pipe read buffer was empty before ) because of the write done in 2 , and the event is consumed in 3. Since the read operation done in 4 does not consume the whole buffer data ( that is, the condition remains I/O space available ) a transition 0 -> 1 cannot happen in 5. The epoll interface, when used with the EPOLLET flag ( Edge Triggered ) should use non-blocking file descriptors to avoid having a blocking read or write starve the task that is handling multiple file descriptors. The suggested way to use epoll as an Edge Triggered ( EPOLLET ) interface is below, and possible pitfalls to avoid follow.
with non-blocking file descriptors
by going to wait for an event only after read(2) or write(2) return EAGAIN
On the contrary, when used as a Level Triggered interface, epoll is by all means a faster poll(2), and can be used wherever the latter is used since it shares the same semantics.
What happens if you add the same fd to an epoll_set twice?
You will probably get EEXIST. However, it is possible that two threads may add the same fd twice. This is a harmless condition.
Can two epoll sets wait for the same fd? If so, are events reported to both epoll sets fds?
Yes. However, it is not recommended. Yes it would be reported to both.
Is the epoll fd itself poll/epoll/selectable?
What happens if the epoll fd is put into its own fd set?
It will fail. However, you can add an epoll fd inside another epoll fd set.
Can I send the epoll fd over a unix-socket to another process?
Will the close of an fd cause it to be removed from all epoll sets automatically?
If more than one event comes in between epoll_wait(2) calls, are they combined or reported separately?
They will be combined.
Does an operation on an fd affect the already collected but not yet reported events?
You can do two operations on an existing fd. Remove would be meaningless for this case. Modify will re-read available I/O.
Do I need to continuously read/write an fd until EAGAIN when using the EPOLLET flag ( Edge Triggered behaviour ) ?
No you dont. Receiving an event from epoll_wait(2) should suggest to you that such file descriptor is ready for the requested I/O operation. You have simply to consider it ready until you will receive the next EAGAIN. When and how you will use such file descriptor is entirely up to you. Also, the condition that the read/write I/O space is exhausted can be detected by checking the amount of data read/write from/to the target file descriptor. For example, if you call read(2) by asking to read a certain amount of data and read(2) returns a lower number of bytes, you can be sure to have exhausted the read I/O space for such file descriptor. Same is valid when writing using the write(2) function.
It is possible that while reading (assuming you are reading in a loop waiting for EAGAIN), more I/O comes in for a second event. That I/O will be read immediately. However, the next time you call epoll_wait(2) on that fd, it will say there is an event ready even though the I/O for it has already been consumed.
A certain amount of data arrives on a monitored file descriptor.
An epoll_wait(2) call returns with the above file descriptor signaled.
Another chunk of data arrives on the same file descriptor.
The file descriptor is internally signaled as ready.
A call to read(2) consumes the whole data available.
Another call to epoll_wait(2) will return the above file descriptor even if data is not available, by making the next call to read(2) to return EAGAIN.
In the case of non-blocking file descriptors, this will result in the next call to read immediately returning with EAGAIN. In the case of blocking file descriptors, you will hang waiting to read I/O that is not there. The author does not recommend using blocking file descriptors together with the Edge Triggered behaviour, but will not stop you.
One way to handle this is to mark the file descriptor as ready in its associated data structure after the first event is received, then ignore other events while it is in the ready state. When you read until receiving EAGAIN, set the ready state bit off before calling epoll_wait(2) again on that fd.
o Starvation ( Edge Triggered )
If there is a large amount of I/O space, it is possible that by trying to drain it the other files will not get processed causing starvation. This is not specific to epoll.
The solution is to maintain a ready list and mark the file descriptor as ready in its associated data structure, thereby allowing the application to remember which files need to be processed but still round robin amongst all the ready files. This also supports ignoring subsequent events you receive for fds that are already ready.
o If using an event cache...
If you use an event cache or store all the fds returned from epoll_wait(2), then make sure to provide a way to mark its closure dynamically (ie- caused by a previous events processing). Suppose you receive 100 events from epoll_wait(2), and in eventi #47 a condition causes event #13 to be closed. If you remove the structure and close() the fd for event #13, then your event cache might still say there are events waiting for that fd causing confusion.
One solution for this is to call, during the processing of event 47, epoll_ctl(EPOLL_CTL_DEL) to delete fd 13 and close(), then mark its associated data structure as removed and link it to a cleanup list. If you find another event for fd 13 in your batch processing, you will discover the fd had been previously removed and there will be no confusion.