Linux Epoll模型(1) --理论与实践

浏览数：18 / 时间：2015年06月20日

引言:

相比于select,Epoll最大的好处在于它不会随着监听fd数目的增长而降低效率。因为在内核中的select实现中，它是采用轮询来处理的，轮询的fd数目越多，自然耗时越多。并且，在linux/posix_types.h头文件有这样的声明：

#define __FD_SETSIZE 1024

表示select最多同时监听1024个fd，当然，可以通过修改头文件再重编译内核来扩大这个数目，但这似乎治标不治本。

常用模型的特点

Linux 下设计并发网络程序,有典型的Apache模型（ Process Per Connection，简称 PPC ）， TPC （ Thread Per Connection ）模型，以及 select 模型和 poll 模型。

1 、PPC/TPC 模型

这两种模型思想类似，就是让每一个到来的连接一边自己做事去，别再来烦我(见博客:http://blog.csdn.net/zjf280441589/article/details/41685103).只是 PPC 是为它开了一个进程，而 TPC 开了一个线程。可是别烦我是有代价的，它要时间和空间啊，连接多了之后，那么多的进程/线程切换，这开销就上来了；因此这类模型能接受的最大连接数都不会高，一般在几百个左右。

2 、select 模型

1. 最大并发数限制，因为一个进程所打开的 FD （文件描述符）是有限制的，由 FD_SETSIZE 设置，默认值是 1024/2048 ，因此 Select 模型的最大并发数就被相应限制了。自己改改这个 FD_SETSIZE ？想法虽好，可是先看看下面吧 …

2. 效率问题，select 每次调用都会线性扫描全部的 FD 集合，这样效率就会呈现线性下降，把 FD_SETSIZE 改大的后果就是，大家都慢慢来，什么？都超时了？？！！

3. 内核/用户空间内存拷贝问题，如何让内核把 FD 消息通知给用户空间呢？在这个问题上 select 采取了内存拷贝方法。

3、 poll 模型

基本上效率和 select 是相同的， select 缺点的 2 和 3 它都没有改掉。

(主要信息来源:[1]http://hzphust.blog.163.com/blog/static/19699017220111091465244/ [2]APUE)

先来看看man手册是怎么Epoll介绍的？

NAME

epoll - I/O event notification facility

SYNOPSIS

#include <sys/epoll.h>

DESCRIPTION

epoll is a variant of poll(2) that can be used either as an edge-triggered or a level-triggered interface and scales well to large numbers of watched file descriptors. The following system calls are provided to create and manage an epoll instance:

* An epoll instance created by epoll_create(2), which returns a file descriptor referring to the epoll instance. (The more recent epoll_create1(2) extends the functionality of epoll_create(2).)

* Interest in particular file descriptors is then registered via epoll_ctl(2). The set of file descriptors currently registered on an epoll instance is sometimes called an epoll set.

* Finally, the actual wait is started by epoll_wait(2).

Epoll 的提升

1. Epoll 没有最大并发连接的限制，上限是最大可以打开文件的数目，这个数字一般远大于 2048(虽然可以通过命令或系统调用来修改这个限制，但不建议这样做，原因如下), 一般来说这个数目和系统内存关系很大，具体数目可以 cat /proc/sys/fs/file-max[599534] 察看[但是如果需要处理上万并发连接的话，最好还是采用多进程模式，这样不但可以充分利用CPU资源，还可以保证系统整体的稳定性]。

2. 效率提升， Epoll最大的优点就在于它只管你“活跃”的连接，因为每次返回时，都只是一个具体事件的列表，而跟连接总数无关，因此在实际的网络环境中， Epoll的效率就会远远高于 select 和 poll 。

3. 内存拷贝，Epoll使用mmap来加速内核与用户空间的消息传递，使得内核空间和用户空间的虚拟内存映射为同一个物理内存块，从而不需要进行数据拷贝。

4.Epoll可以支持内核微调(得益于Linux操作系统提供的内核微调的能力)。

Epoll 为什么高效?

Epoll 的高效和其数据结构的设计是密不可分的。

首先回忆一下 select 模型，当有 I/O 事件到来时， select通知应用程序有事件到了快去处理，而应用程序必须轮询所有的fd集合，测试每个fd是否有事件发生，并处理事件。

int res = select(maxfd+1, &readfds, NULL, NULL, 120);
if (res > 0)
{
    for (int i = 0; i < MAX_CONNECTION; i++)
    {
        if (FD_ISSET(allConnection[i], &readfds))
        {
            handleEvent(allConnection[i]);
        }
    }
}
// if(res == 0) handle timeout, res < 0 handle error

Epoll 不仅会告诉应用程序有I/0 事件到来，还会告诉应用程序相关的信息，这些信息是应用程序填充的，因此根据这些信息应用程序就能直接定位到事件，而不必遍历整个FD 集合。

int res = Epoll_wait(epfd, events, 20, 120);
for (int i = 0; i < res;i++)
{
    handleEvent(events[n]);
}

Epoll 关键数据结构

前面提到 Epoll 速度快和其数据结构密不可分，其关键数据结构就是：

struct Epoll_event
{
    __uint32_t events;      // Epoll events
    Epoll_data_t data;      // User data variable
};

typedef union Epoll_data
{
    void *ptr;
    int fd;
    __uint32_t u32;
    __uint64_t u64;
} Epoll_data_t;

在这个结构体Epoll_event中，我们最关心的是events字段，他用于设置关心哪些I/O事件，以及采用何种工作模式(LT/ET下面会有介绍)，一次可以设置多种I/O事件和工作模式，使用“|”运算符进行组合。

结构体中data字段为用户私有字段，Epoll_data 是一个 union 结构体 , 借助于它应用程序可以保存很多类型的信息 :fd 、指针等，而且Epoll也不会操作这些数据。其作用就是可以设置一些私有数据，使得在处理相关事件时更容易组织数据。绝大多数情况是将要监控的I/O句柄保存在这里！

Epoll函数接口

Epoll的接口非常简单，一共就三个函数：

1.int Epoll_create(int size);

创建一个Epoll的句柄，size用来告诉内核这个监听的数目一共有多大，不过size这个参数实际上早就作废了，存在的意义就是为了向下兼容，所以在任何时候，这个值都是任意的，推荐使用256，但这并不会给程序的性能带来多大的影响。这个参数不同于select()中的第一个参数，给出最大监听的fd+1的值。需要注意的是，当创建好Epoll句柄后，它就是会占用一个fd值，在linux下如果查看/proc/<pid>/fd/，是能够看到这个fd的，所以在使用完Epoll后，必须调用close()关闭，否则可能导致fd被耗尽。[Size的大小不代表epoll支持的最大句柄个数，而隐射了内核扩展句柄存储的尺寸，也就是说当后面需要再向epoll中添加句柄遇到存储不够的时候，内核会按照size追加分配。在2.6以后的内核中，该值失去了意义，但必须大于0。]

2.int Epoll_ctl(int epfd, int op, int fd, struct Epoll_event *event);

Epoll的事件注册函数，它不同与select()是在监听事件时告诉内核要监听什么类型的事件，而是在这里先注册要监听的事件类型。控制某个 Epoll 文件描述符上的事件：注册、修改、删除。

参数:

epfd是Epoll_create()的返回值，创建 Epoll 专用的文件描述符。相对于 select 模型中的 FD_SET 和 FD_CLR 宏。

op参数表示动作，用三个宏来表示：

EPOLL_CTL_ADD：注册新的fd到epfd中；

EPOLL_CTL_MOD：修改已经注册的fd的监听事件；

EPOLL_CTL_DEL：从epfd中删除一个fd；

fd参数是需要监听的fd;

event参数是告诉内核需要监听什么事，struct Epoll_event结构如下：

struct Epoll_event {
  __uint32_t events;  /* Epoll events */
  Epoll_data_t data;  /* User data variable */
};

events可以是以下几个宏的集合：

EPOLLIN:表示对应的文件描述符可以读（包括对端SOCKET正常关闭）；

EPOLLOUT:表示对应的文件描述符可以写；

EPOLLPRI:表示对应的文件描述符有紧急的数据可读（这里应该表示有带外数据到来）；

EPOLLERR:表示对应的文件描述符发生错误；

EPOLLHUP:表示对应的文件描述符被挂断；

EPOLLET:将EPOLL设为边缘触发(Edge Triggered)模式，这是相对于水平触发(Level Triggered)来说的。

EPOLLONESHOT:只监听一次事件，当监听完这次事件之后，如果还需要继续监听这个socket的话，需要再次把这个socket加入到EPOLL队列里

(常用的就是EPOLLIN、EPOLLOUT、EPOLLET)

3.int Epoll_wait(int epfd, struct Epoll_event * events, int maxevents, int timeout);

等待 I/O 事件的发生；

参数说明：

epfd: 由 Epoll_create() 生成的 Epoll 专用的文件描述符；

events: 用于回传代处理事件的数组，是一个已发生的事件列表，调用者一般都是迭代这个数组，依次处理所有事件。这个数组需要实现分配好，具体大小（严格意义上是元素的个数）根据具体情况而定；

maxevents: 是一次最多能够返回的事件数[必须大于0并且小于或等于events的大小，否则会出现内存越界]；

timeout: 等待 I/O 事件发生的超时值单位是百万分之一秒，如果单独创建了一个Epoll专有线程，可以将timeout设置为-1，即永不超时；

返回值：

返回发生事件的I/O句柄数量，如果返回0，则表明超时。

Epoll的工作模式:

Edge Triggered (ET) 边缘触发:只有数据到来，才触发，不管缓存区中是否还有数据。

Level Triggered (LT) 条件触发:只要有数据都会触发。

详细解释：

LT(level triggered)是缺省的工作方式，并且同时支持block和no-block socket.在这种做法中，内核告诉你一个文件描述符是否就绪了，然后你可以对这个就绪的fd进行IO操作。如果你不作任何操作，内核还是会继续通知你的，所以，这种模式编程出错误可能性要小一点。传统的select/poll都是这种模型的代表．

ET(edge-triggered)是高速工作方式，只支持no-block socket。在这种模式下，当描述符从未就绪变为就绪时，内核通过Epoll告诉你。然后它会假设你知道文件描述符已经就绪，并且不会再为那个文件描述符发送更多的就绪通知，直到你做了某些操作导致那个文件描述符不再为就绪状态了(比如，你在发送，接收或者接收请求，或者发送接收的数据少于一定量时导致了一个EWOULDBLOCK 错误）。但是请注意，如果一直不对这个fd作IO操作(从而导致它再次变成未就绪)，内核不会发送更多的通知(only once),不过在TCP协议中，ET模式的加速效用仍需要更多的benchmark确认（这句话不理解）。

Man手册中的解释：

Level-Triggered and Edge-Triggered

The epoll event distribution interface is able to behave both as edge-triggered (ET) and as level-triggered (LT). The difference between the two mechanisms can be described as follows. Suppose that this scenario happens:

1. The file descriptor that represents the read side of a pipe (rfd) is registered on the epoll instance.

2. A pipe writer writes 2 kB of data on the write side of the pipe.

3. A call to epoll_wait(2) is done that will return rfd as a ready file descriptor.

4. The pipe reader reads 1 kB of data from rfd.

5. A call to epoll_wait(2) is done.

If the rfd file descriptor has been added to the epoll interface using the EPOLLET (edge-triggered) flag, the call to epoll_wait(2) done in step 5 will probably hang despite the available data still present in the file input buffer; meanwhile the remote peer might be expecting a response based on the data it already sent. The reason for this is that edge-triggered mode only delivers events when changes occur on the monitored file descriptor. So, in step 5 the caller might end up waiting for some data that is already present inside the input buffer. In the above example, an event on rfd will be generated because of the write done in 2 and the event is consumed in 3. Since the read operation done in 4 does not consume the whole buffer data, the call to epoll_wait(2) done in step 5 might block indefinitely.

An application that employs the EPOLLET flag should use non-blocking file descriptors to avoid having a blocking read or write starve a task that is handling multiple file descriptors. The suggested way to use epoll as an edge-triggered (EPOLLET) interface is as follows:

i with non-blocking file descriptors; and

ii by waiting for an event only after read(2) or write(2) return EAGAIN.

By contrast, when used as a level-triggered interface (the default, when EPOLLET is not specified), epoll is simply a faster poll(2), and can be used wherever the latter is used since it shares the same semantics.

Since even with edge-triggered epoll, multiple events can be generated upon receipt of multiple chunks of data, the caller has the option to specify the EPOLLONESHOT flag, to tell epoll to disable the associated file descriptor after the receipt of an event with epoll_wait(2). When the EPOLLONESHOT flag is specified, it is the caller‘s responsibility to rearm the file descriptor using epoll_ctl(2) with EPOLL_CTL_MOD.

运用实例：

#define MAX_EVENTS 10
struct epoll_event ev, events[MAX_EVENTS];
int listen_sock, conn_sock, nfds, epollfd;

/* Set up listening socket, ‘listen_sock‘ (socket(),
   bind(), listen()) */

epollfd = epoll_create(10);
if (epollfd == -1)
{
    perror("epoll_create");
    exit(EXIT_FAILURE);
}

ev.events = EPOLLIN;
ev.data.fd = listen_sock;
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == -1)
{
    perror("epoll_ctl: listen_sock");
    exit(EXIT_FAILURE);
}

for (;;)
{
    nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
    if (nfds == -1)
    {
        perror("epoll_pwait");
        exit(EXIT_FAILURE);
    }

    for (n = 0; n < nfds; ++n)
    {
        if (events[n].data.fd == listen_sock)
        {
            conn_sock = accept(listen_sock,
                               (struct sockaddr *) &local, &addrlen);
            if (conn_sock == -1)
            {
                perror("accept");
                exit(EXIT_FAILURE);
            }
            setnonblocking(conn_sock);
            ev.events = EPOLLIN | EPOLLET;
            ev.data.fd = conn_sock;
            if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
                          &ev) == -1)
            {
                perror("epoll_ctl: conn_sock");
                exit(EXIT_FAILURE);
            }
        }
        else
        {
            do_use_fd(events[n].data.fd);
        }
    }
}

郑重声明：本站内容如果来自互联网及其他传播媒体，其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享，并不代表本站赞同其观点和对其真实性负责，也不构成任何其他建议。

Linux Epoll模型(1) --理论与实践