专注分布式系统、基础架构

Data Organization in InnoDB

发表于2013年9月16日由Matrix

From: https://blogs.oracle.com/mysqlinnodb/entry/data_organization_in_innodb

Introduction

This article will explain how the data is organized in InnoDB storage engine. First we will look at the various files that are created by InnoDB, then we look at the logical data organization like tablespaces, pages, segments and extents. We will explore each of them in some detail and discuss about their relationship with each other. At the end of this article, the reader will have a high level view of the data layout within the InnoDB storage engine.

The Files

MySQL will store all data within the data directory. The data directory can be specified using the command line option –data-dir or in the configuration file as datadir. Refer to the Server Command Options for complete details.

By default, when InnoDB is initialized, it creates 3 important files in the data directory – ibdata1, ib_logfile0 and ib_logfile1. The ibdata1 is the data file in which system and user data will be stored. The ib_logfile0 and ib_logfile1 are the redo log files. The location and size of these files are configurable. Refer to Configuring InnoDB for more details.

The data file ibdata1 belongs to the system tablespace with tablespace id (space_id) of 0. The system tablespace can contain more than 1 data file. As of MySQL 5.6, only the system tablespace can contain more than 1 data file. All other tablespaces can contain only one data file. Also, only the system tablespace can contain more than one table, while all other tablespaces can contain only one table.

The data files and the redo log files are represented in the memory by the C structure fil_node_t.

Tablespaces

By default, InnoDB contains only one tablespace called the system tablespace whose identifier is 0. More tablespaces can be created indirectly using the innodb_file_per_table configuration parameter. In MySQL 5.6, this configuration parameter is ON by default. When it is ON, each table will be created in its own tablespace in a separate data file.

The relationship between the tablespace and data files is explained in the InnoDB source code comment (storage/innobase/fil/fil0fil.cc) which is quoted here for reference:

“A tablespace consists of a chain of files. The size of the files does not have to be divisible by the database block size, because we may just leave the last incomplete block unused. When a new file is appended to the tablespace, the maximum size of the file is also specified. At the moment, we think that it is best to extend the file to its maximum size already at the creation of the file, because then we can avoid dynamically extending the file when more space is needed for the tablespace.”

The last statement about avoiding dynamic extension is applicable only to the redo log files and not the data files. Data files are dynamically extended, but redo log files are pre-allocated. Also, as already mentioned earlier, only the system tablespace can have more than one data file.

It is also clearly mentioned that even though the tablespace can have multiple files, they are thought of as one single large file concatenated together. So the order of files within the tablespace is important.

Pages

A data file is logically divided into equal sized pages. The first page of the first data file is identified with page number of 0, and the next page would be 1 and so on. A page within a tablespace is uniquely identified by the page identifier or page number (page_no). And each tablespace is uniquely identified by the tablespace identifier (space_id). So a page is uniquely identified throughout InnoDB by using the (space_id, page_no) combination. And any location within InnoDB can be uniquely identified by the (space_id, page_no, page_offset) combination, where page_offset is the number of bytes within the given page.

How the pages from different data files relate to one another is explained in another source code comment: “A block’s position in the tablespace is specified with a 32-bit unsigned integer. The files in the chain are thought to be catenated, and the block corresponding to an address n is the nth block in the catenated file (where the first block is named the 0th block, and the incomplete block fragments at the end of files are not taken into account). A tablespace can be extended by appending a new file at the end of the chain.” This means that the first page of all the data files will not have page_no of 0 (zero). Only the first page of the first data file in a tablespace will have the page_no as 0 (zero).

Also in the above comment it is mentioned that the page_no is a 32-bit unsigned integer. This is the size of the page_no when stored on the disk.

Every page has a page header (page_header_t). For more details on this please refer to the Jeremy Cole’s blog “The basics of InnoDB space file layout.”

Extents

An extent is 1MB of consecutive pages. The size of one extent is defined as follows (1048576 bytes = 1MB):

#define FSP_EXTENT_SIZE (1048576U / UNIV_PAGE_SIZE)

The macro UNIV_PAGE_SIZE used to be a compile time constant. From mysql-5.6 onwards it is a global variable. The number of pages in an extent depends on the page size used. If the page size is 16K (the default), then an extent would contain 64 pages.

Page Types

One page can be used for many purposes. The page type will identify the purpose for which the page is being used. The page type of each page will be stored in the page header. The page types are available in the header file storage/innobase/include/fil0fil.h. The following table provides a brief description of the page types.

Page Type	Description
FIL_PAGE_INDEX	The page is a B-tree node
FIL_PAGE_UNDO_LOG	The page stores undo logs
FIL_PAGE_INODE	contains an array of fseg_inode_t objects.
FIL_PAGE_IBUF_FREE_LIST	The page is in the free list of insert buffer or change buffer.
FIL_PAGE_TYPE_ALLOCATED	Freshly allocated page.
FIL_PAGE_IBUF_BITMAP	Insert buffer or change buffer bitmap
FIL_PAGE_TYPE_SYS	System page
FIL_PAGE_TYPE_TRX_SYS	Transaction system data
FIL_PAGE_TYPE_FSP_HDR	File space header
FIL_PAGE_TYPE_XDES	Extent Descriptor Page
FIL_PAGE_TYPE_BLOB	Uncompressed BLOB page
FIL_PAGE_TYPE_ZBLOB	First compressed BLOB page
FIL_PAGE_TYPE_ZBLOB2	Subsequent compressed BLOB page

Each page type is used for different purposes. It is beyond the scope of this article, to explore each page type. For now, it is sufficient to know that all pages have a page header (page_header_t) and they store the page type in it, and based on the page type the contents and the layout of the page would be decided.

Tablespace Header

Each tablespace will have a header of type fsp_header_t. This data structure is stored in the first page of a tablespace.

The table space identifier (space_id)
Current size of the table space in pages.
List of free extents
List of full extents not belonging to any segment.
List of partially full/free extents not belonging to any segment.
List of pages containing segment headers, where all the segment inode slots are reserved. (pages of type FIL_PAGE_INODE)
List of pages containing segment headers, where not all the segment inode slots are reserved. (pages of type FIL_PAGE_INODE).

InnoDB Tablespace Header Structure

From the tablespace header, we can access the list of segments available in the tablespace. The total space occupied by the tablespace header is given by the macro FSP_HEADER_SIZE, which is equal to 16*7 = 112 bytes.

Reserved Pages of Tablespace

As mentioned earlier, InnoDB will always contain one tablespace called the system tablespace with identifier 0. This is a special tablespace and is always kept open as long as the MySQL server is running. The first few pages of the system tablespace is reserved for internal usage. This information can be obtained from the header storage/innobase/include/fsp0types.h. They are listed below with a short description.

Page Number	The Page Name	Description
0	FSP_XDES_OFFSET	The extent descriptor page.
1	FSP_IBUF_BITMAP_OFFSET	The insert buffer bitmap page.
2	FSP_FIRST_INODE_PAGE_NO	The first inode page number.
3	FSP_IBUF_HEADER_PAGE_NO	Insert buffer header page in system tablespace.
4	FSP_IBUF_TREE_ROOT_PAGE_NO	Insert buffer B-tree root page in system tablespace.
5	FSP_TRX_SYS_PAGE_NO	Transaction system header in system tablespace.
6	FSP_FIRST_RSEG_PAGE_NO	First rollback segment page, in system tablespace.
7	FSP_DICT_HDR_PAGE_NO	Data dictionary header page in system tablespace.

As can be noted from above, the first 3 pages will be there in any tablespace. But the last 5 pages are reserved only in the case of system tablespace. In the case of other tablespaces only 3 pages are reserved.

When the option innodb_file_per_table is enabled, then for each table a separate tablespace with one data file would be created. The source code comment in the function dict_build_table_def_step() states the following:

                /* We create a new single-table tablespace for the table. 
                We initially let it be 4 pages: 
                - page 0 is the fsp header and an extent descriptor page, 
                - page 1 is an ibuf bitmap page, 
                - page 2 is the first inode page, 
                - page 3 will contain the root of the clustered index of the 
                table we create here. */

File Segments

A tablespace can contain many file segments. File segments (or just segments) is a logical entity. Each segment has a segment header (fseg_header_t), which points to the inode (fseg_inode_t) describing the file segment. The file segment header contains the following information:

The space to which the inode belongs
The page_no of the inode
The byte offset of the inode
The length of the file segment header (in bytes).

Note: It would have been really more readable (at source code level) if fseg_header_t and fseg_inode_t had proper C-style structures defined for them.

The fseg_inode_t object contains the following information:

The segment id to which it belongs.
List of full extents.
List of free extents of this segment.
List of partially full/free extents
Array of individual pages belonging to this segment. The size of this array is half an extent.

When a segment wants to grow, it will get free extents or pages from the tablespace to which it belongs.

Table

In InnoDB, when a table is created, a clustered index (B-tree) is created internally. This B-tree contains two file segments, one for the non-leaf pages and the other for the leaf pages. From the source code documentation:

“In the root node of a B-tree there are two file segment headers. The leaf pages of a tree are allocated from one file segment, to make them consecutive on disk if possible. From the other file segment we allocate pages for the non-leaf levels of the tree.”

For a given table, the root page of a B-tree will be obtained from the data dictionary. So in InnoDB, each table exists within a tablespace, and contains one B-tree (the clustered index), which contains 2 file segments. Each file segment can contain many extents, and each extent contains 1MB of consecutive pages.

Conclusion

This article discussed the details about the data organization within InnoDB. We first looked at the files created by InnoDB, and then discussed about the various logical entities like tablespaces, pages, page types, extents, segments and tables. We also looked at the relationship between each one of them.

发表在 MySQL | 已关闭评论

分布式原理: 一致性&持久性

发表于2013年9月12日由Matrix

转载自：http://goleo8.iteye.com/blog/662108

1. 什么是一致性、持久性以及事务

当一个原子操作具有了一致性，隔离性和持久性之后，这个原子操作就可以被称为事务。
Consistency is an application-defined requirement that every update to a collection of data must preserve some specified invariant. Different applications can have quite different consistency invariants.
我们一直讨论一致性，并将一致性理解为我们所看到的数据和一系列操作所更新的数据是一致的。但是实际上这并不是一致性的最本质的含义。本质的含义是，根据应用的需求，我们对一系列数据集的操作要维持一个不变量。例如，表的行号应和行数是对应的。cache应和后台数据是对应的。

书中花费了很大段讨论Atomicity。原子性分为all-of-nothing atomicity和isolation atomicity.

强一致性：就是将不一致隐藏在系统边缘内部。从外部看任何时候系统都是一致的。
最终一致性：主要表现在更新数据时，有一段时间从系统边缘外看，是不一致的，但是在某个时间段后，一致性会得到保证。

有时最终一致性反倒是一个优点。比如Download一个网页时，先出现文字，后出现图片。在download过程中，页面与后台数据是不一致的，但是这反而改善了用户体验。

2. Cache coherence

Cache的一致性要求在于Cache中存储的数据应当和二级存储中的数据应当的相等的。但是由于从Cache到二级存储的延迟，存在某个时间段，Cache和二级存贮中的数据是不同的。

Cache应当满足读写一致性：The result of a read of a named object is always the value of the
most recent write to that object.
请求对一个Object读，读到的应该是最近一次写的结果。
Cache分成：
1. Write through cache（直写式缓存）每次写操作不光写cache还写到二级存储，这样就容易造成性能的瓶颈。
2. Write back cache（回写式缓存）先是将写操作写到cache中，这时应用就可以认为写操作已经完成了。而将cache中的数据更新到二级存储是由cache manager来完成。
如果只有一个cache那么回写式缓存也能够提供强一致性，但是如果thread能够直接从二级缓存读数据或者有多个cache，可能其中某个cache并不是最新数据那么一致性就受到了破坏。

如何在分布式缓存中仍然能够获得一致性？
1.         如果shared和writable的数据很少，那么可以将这些数据标示成“不可缓存”。
a)         World Wide Web采用了这种方式。在HTTP头有一个字段，可以设置“不可缓存”这样Web Page就不会被缓存。
b)        Java内存模型中一种思想类似的方法是将一个变量声明为volatile。
2.         另外一种思想是将那些与权威副本不一致的缓存标示为无效。
a)         一个设计思想，如果在多个处理器共享的二级存储上共享cache，则可以避免不一致性。但是共享cache会导致处理器对cache的竞争和互斥。这样会降低系统性能。因此对于每个处理器提供一个单独的私有的cache。这样就产生了不一致性。即使是使用直写式缓存，处理器A对数据Data的更新却无法写到处理器B的私有缓存上，这样就导致了不一致性。这就需要有一种机制去告诉那些使用了数据Data的处理器，数据Data已经失效。
i.              当有一个处理器写的时候，告诉其他所有处理器他们的私有缓存全部都失效了。
另外一种方法是使用更多的wire，去告诉其他私有缓存内存中的那个地址的数据失效了。一种方法就是私有缓存都侦听memory bus。当memory bus上有写操作的时候。A slightly more clever
design will also grab the data value from the bus as it goes by and update, rather than invalidate, its copy of that data. These are two variations on what is called the snoopy cache*—each cache is snooping on bus activity.
ii.              即使使用了Snoopy Cache仍然会遇到问题。cache的问题解决了，但是register却会带来新的同步问题。
3.         只要是允许副本被多个请求并发的访问，如何维护隔离性和一致性就是一个复杂的问题。采用锁的方式避免并发访问是一种解法，但是又会影响到性能。一种可行的方法是使用概率。
3.      持久性，以及有多个副本带来的一致性问题

持久性就会带来多个副本之间的一致性问题：下面都是处理多个副本之间的不一致性。
The durability mantra
Multiple copies, widely separated and independently administered…
Multiple copies, widely separated and independently administered…

1.         Replicated state machine
If the data is written exactly once and never again changed, the management plan can be fairly straightforward。这个是Hadoop和Cassandra能够保证多个副本一致的前提。
2.        Master and Slave结构
M/S结构的一个很大的弱点就是M更新的时候，读S读到的都是旧的数据。设计一个MS的结构会面对一系列的问题:比如M和S瞬时的不一致。还有就是M的数据decay了，S如果还没有同步，则同步的数据也是错误的。
3.        保证分布式数据的完整性？
a)        可以在副本之间做校验，但是一旦这些数据之间的传输开销很大的话，聚会造成很大的时间成本。
b)        并不是直接对比而是传输MD5checksum
总结：
1）简单副本（RAID）2）为了避免地震等故障，更加分布的副本GFS3）按照某种逻辑写数据，也就是大家写数据的时候都遵循一定的规则，从而避免不一致的情况4）运用概率提升性能5）Replicated state machines6）Master/Slave结构->为了避免M和S的不一致性，可以将M的表划分细小，然后每个细小的表有一个Master->为了避免M和S的不一致使用两阶段提交协议->当M失效之后使用选举算法选出新的Master->If the application is one in which the data is insensitive to the order of updates, implement a replicated state machine without a consensus algorithm.->增量更新->传递的不是增量而是操作的log7）quorum算法
4.      协调（Reconciliation）算法

什么是协调？当系统update到一半，或者M-S结构里面M宕机了，或者数据副本出现不一致状态了。那么从这种不“和谐”的状态重新归于“和谐”就是reconciliation。
1.          Occasionally connected operation
这个场景就例如iphone和iMac之间的数据同步。更好的比喻是SVN，client和SVN上面文件的同步。
如何发现文件之间的不同：
a)       checksum
b)       维护一个统一的文件id，一旦产生变化就将id加1。
c)       通过时间戳。文件更改了则时间戳就会更新。
5.      Atomicity across layers and multiple sites

两阶段提交协议的两个阶段：（注意区分两阶段提交协议2PC和两阶段锁协议2PL）
达成协议阶段：
在这个阶段协调者向所有要执行commit操作的节点发出“尝试commit”的请求。节点接受到请求后“尝试commit”，例如包括更新log——在undo log中增加新的信息，在redo log中增加新信息。当发现可以正常的commit，则返回一个Yes的消息，否则返回一个No的消息表示abort。
如果coordinator收到了全部的Yes消息，则发出一个commit消息，则所有的节点都commit，并释放占有的资源锁。
如果coordinator收到的消息中有No消息，表示某个节点不能够commit，则coordinator群发一个rollback的消息。这时每个节点根据undo log中的日志回滚，然后释放占用的资源锁。这里正是memento模式的用武之地!
缺点：
这是一个异步协议，对系统的可用性影响极大；如果coordinator失效了，可能会导致一些node的锁永远不会被释放，被永远绑定。如果node向coordinator发送了agreement消息，并等待commit或者rollback的反馈。如果这个时候coordinator挂了，这个就会被永远锁住，除非从其他的coordinator那里能得到相应的反馈。
当coordinator发送一个“Query-to-commit”消息的时候，在收到全体相应之前coordinator也是被阻塞的。但是如果一个node没有响应，coordinator不会被永久阻塞。因为coordinator可以引入一个timeout机制避免被永久阻塞。
因为上面提到的time out机制，这个协议的另外一大弱点是：偏向于abort一个case，而不是complete一个case。
Implementing the two-phase commit protocol

[edit]Common architecture

In many cases the 2PC protocol is distributed in a computer network. It is easily distributed by implementing multiple dedicated 2PC components similar to each other, typically named Transaction managers (TMs; also referred to as 2PC agents), that carry out the protocol’s execution for each transaction (e.g., The Open Group’s X/Open XA). The databases involved with a distributed transaction, the participants, both the coordinator and cohorts, register to close TMs (typically residing on respective same network nodes as the participants) for terminating that transaction using 2PC. Each distributed transaction has an ad hoc set of TMs, the TMs to which the transaction participants register. A leader, the coordinator TM, exists for each transaction to coordinate 2PC for it, typically the TM of the coordinator database. However, the coordinator role can be transferred to another TM for performance or reliability reasons. Rather than exchanging 2PC messages among themselves, the participants exchange the messages with their respective TMs. The relevant TMs communicate among themselves to execute the 2PC protocol schema above, “representing” the respective participants, for terminating that transaction. With this architecture the protocol is fully distributed (does not need any central processing component or data structure), and scales up with number of network nodes (network size) effectively.
This common architecture is also effective for the distribution of other atomic commitment protocols besides 2PC, since all such protocols use the same voting mechanism and outcome propagation to protocol participants.[1] [2]
[edit]Protocol optimizations

Database research has been done on ways to get most of the benefits of the two-phase commit protocol while reducing costs by protocol optimizations [1] [2] and protocol operations saving under certain system’s behavior assumptions.
[edit]Presume abort and Presume commit

Presumed abort or Presumed commit are common such optimizations.[3][2] An assumption about the outcome of transactions, either commit, or abort, can save both messages and logging operations by the participants during the 2PC protocol’s execution. For example, when presumed abort, if during system recovery from failure no logged evidence for commit of some transaction is found by the recovery procedure, then it assumes that the transaction has been aborted, and acts accordingly. This means that it does not matter if aborts are logged at all, and such logging can be saved under this assumption. Typically a penalty of additional operations is paid during recovery from failure, depending on optimization type. Thus the best variant of optimization, if any, is chosen according to failure and transaction outcome statistics.
[edit]Tree two-phase commit protocol

The Tree 2PC protocol [2] (also called Nested 2PC, or Recursive 2PC) is a common variant of 2PC in a network, which better utilizes the underlying communication infrastructure. In this variant the coordinator is the root (“top”) of a communication tree (inverted tree), while the cohorts are the other nodes. Messages from the coordinator are propagated “down” the tree, while messages to the coordinator are “collected” by a cohort from all the cohorts below it, before it sends the appropriate message “up” the tree (except an abort message, which is propagated “up” immediately upon receiving it, or if this cohort decided to abort).
The Dynamic two-phase commit (Dynamic two-phase commitment, D2PC) protocol[4][2] is a variant of Tree 2PC with no predetermined coordinator. Agreement messages start to propagate from all the leaves, each leaf when completing its tasks on behalf of the transaction (becoming ready), and the coordinator is determined dynamically by racingagreement messages, at the place where they collide. They collide either on a transaction tree node, or on an edge. In the latter case one of the two edge’s nodes is elected as a coordinator (any node). D2PC is time optimal (among all the instances of a specific transaction tree, and any specific Tree 2PC protocol implementation; all instances have the same tree; each instance has a different node as coordinator): it commits the coordinator and each cohort in minimum possible time, allowing earlier release of locked resources.

6. 一致性判别

数据一致性通常指关联数据之间的逻辑关系是否正确和完整。而数据存储的一致性模型则可以认为是存储系统和数据使用者之间的一种约定。如果使用者遵循这种约定，则可以得到系统所承诺的访问结果。
常用的一致性模型有：
严格一致性（linearizability, strict/atomic Consistency）：读出的数据始终为最近写入的数据。这种一致性只有全局时钟存在时才有可能，在分布式网络环境不可能实现。
弱一致性（weak consistency）：只要求对共享数据结构的访问保证顺序一致性。对于同步变量的操作具有顺序一致性，是全局可见的，且只有当没有写操作等待处理时才可进行，以保证对于临界区域的访问顺序进行。在同步时点，所有使用者可以看到相同的数据。
最终一致性（eventual consistency）：当没有新更新的情况下，更新最终会通过网络传播到所有副本点，所有副本点最终会一致，也就是说使用者在最终某个时间点前的中间过程中无法保证看到的是新写入的数据。可以采用最终一致性模型有一个关键要求：读出陈旧数据是可以接受的。
顺序一致性（sequential consistency）：所有使用者以同样的顺序看到对同一数据的操作，但是该顺序不一定是实时的。
因果一致性（causal consistency）：只有存在因果关系的写操作才要求所有使用者以相同的次序看到，对于无因果关系的写入则并行进行，无次序保证。因果一致性可以看做对顺序一致性性能的一种优化，但在实现时必须建立与维护因果依赖图，是相当困难的。
管道一致性（PRAM/FIFO consistency）：在因果一致性模型上的进一步弱化，要求由某一个使用者完成的写操作可以被其他所有的使用者按照顺序的感知到，而从不同使用者中来的写操作则无需保证顺序，就像一个一个的管道一样。相对来说比较容易实现。
释放一致性（release consistency）：弱一致性无法区分使用者是要进入临界区还是要出临界区，释放一致性使用两个不同的操作语句进行了区分。需要写入时使用者acquire该对象，写完后release，acquire-release之间形成了一个临界区，提供释放一致性也就意味着当release操作发生后，所有使用者应该可以看到该操作。
delta consistency：系统会在delta时间内达到一致。这段时间内会存在一个不一致的窗口，该窗口可能是因为log shipping的过程导致。

最终一致性的几种具体实现：
1、读不旧于写一致性（Read-your-writes consistency）：使用者读到的数据，总是不旧于自身上一个写入的数据。
2、会话一致性（Session consistency）：比读不旧于写一致性更弱化。使用者在一个会话中才保证读写一致性，启动新会话后则无需保证。
3、单读一致性（Monotonic read consistency）：读到的数据总是不旧于上一次读到的数据。
4、单写一致性（Monotonic write consistency）：写入的数据完成后才能开始下一次的写入。
5、写不旧于读一致性（Writes-follow-reads consistency）：写入的副本不旧于上一次读到的数据，即不会写入更旧的数据.
6、选举一致性：
Werner Vogels认为：在很多互联网应用中，单读一致性＋读不旧于写一致性可以提供足够的一致性了。

发表在分布式系统理论 | 已关闭评论

Linux下pipe使用注意事项

发表于2013年9月12日由Matrix

转载自: http://blog.yufeng.info/archives/1485

Linux下的pipe使用非常广泛, shell本身就大量用pipe来粘合生产者和消费者的. 我们的服务器程序通常会用pipe来做线程间的ipc通讯. 由于unix下的任何东西都是文件,只要是文件,在读取的时候,,就会设置last access time, 所以pipe也不例外., 但是这个时间对我们没有意义如果pipe使用的非常频繁的时候会碰到由于设置访问时间导致的性能问题. 这个开销远比pipe读写的本身开销大. 相比文件读写的开销, atime微不足道,但是对pipe来讲就不同了.
这个事情是上次和多隆同学在把玩他的网络框架的时候,无意发现的.

我们来分析下pipe的这部分代码:

//pipe.c:L349
static ssize_t
pipe_read(struct kiocb *iocb, const struct iovec *_iov,
               unsigned long nr_segs, loff_t pos)
{
...
   if (ret > 0)
        file_accessed(filp);
    return ret;
}

我们可以看到在pipe读的时候要设置 file_accessed时间的,接着:

//fs.h:L1761
extern void touch_atime(struct vfsmount *mnt, struct dentry *dentry);
static inline void file_accessed(struct file *file)
{
        if (!(file->f_flags & O_NOATIME))
                touch_atime(file->f_path.mnt, file->f_path.dentry);
}

如果文件没设置 O_NOATIME就真正动手设置atime,接着:

//inode.c:L1493
void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
{
        struct inode *inode = dentry->d_inode;
        struct timespec now;

        if (inode->i_flags & S_NOATIME)
                return;
        if (IS_NOATIME(inode))
                return;
        if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode))
                return;

        if (mnt->mnt_flags & MNT_NOATIME)
                return;
        if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
                return;

        now = current_fs_time(inode->i_sb);

        if (!relatime_need_update(mnt, inode, now))
                return;

        if (timespec_equal(&inode->i_atime, &now))
                return;

        if (mnt_want_write(mnt))
                return;

        inode->i_atime = now;
        mark_inode_dirty_sync(inode);
        mnt_drop_write(mnt);
}

我们可以看出上面的流程还是比较复杂的,开销也很大.
我们来演示下:

$ cat > pipe_test.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <pthread.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <linux/unistd.h>

static int fds[2];
static pthread_t rp;

static void *rp_entry(void *arg) {
  char c[1];
  while (1 == read(fds[0], c, 1)) {
    if (*c == 'Q') break;
  }
  fprintf(stderr, "pipe read ok\n");
  return NULL;
}

int main(int argc, char *argv[]) {
  long i, n;
  int rc;
  if (argc < 2) {
    fprintf(stderr, "usage: pipe_test NNNNNN\n");
    return -1;
  }
  n = atol(argv[1]);
  pipe(fds);
  //fcntl(fds[0], F_SETFL, O_NOATIME);
  pthread_create(&rp, NULL, rp_entry, NULL);
  fprintf(stderr, "pipe write %ld...", n);
  for (i = 0; i < n; i++) {
    write(fds[1], "A", 1);
  }
  write(fds[1], "Q", 1);
  fprintf(stderr, "ok\n");
  pthread_join(rp, NULL);
  close(fds[0]);
  close(fds[1]);
  return 0;
}
CTRL+D
$ gcc -D_GNU_SOURCE pipe_test.c -lpthread
$ sudo opcontrol --setup --vmlinux=/usr/lib/debug/lib/modules/2.6.18-164.el5/vmlinux
$ sudo opcontrol --init && sudo opcontrol --reset && sudo opcontrol --start
$ ./a.out 10000000
pipe write 10000000...ok
pipe read ok
$ sudo opcontrol --shutdown
$ opreport -l|less            
samples  %        app name                 symbol name
378654   92.7742  vmlinux                  .text.acpi_processor_idle
12978     3.1797  vmlinux                  current_fs_time
2530      0.6199  vmlinux                  thread_return
2345      0.5745  vmlinux                  touch_atime
2253      0.5520  vmlinux                  .text.acpi_safe_halt
1597      0.3913  vmlinux                  timespec_trunc
1368      0.3352  vmlinux                  file_update_time
1253      0.3070  vmlinux                  __mark_inode_dirty
901       0.2208  vmlinux                  pipe_writev
768       0.1882  vmlinux                  __mutex_lock_slowpath
763       0.1869  vmlinux                  try_to_wake_up
270       0.0662  vmlinux                  copy_user_generic_unrolled
254       0.0622  vmlinux                  acpi_set_register
254       0.0622  vmlinux                  system_call
233       0.0571  vmlinux                  pipe_readv
188       0.0461  vmlinux                  dnotify_parent
167       0.0409  vmlinux                  mutex_unlock
...

我们可以看到touch_atime的开销很大,远比pipe的读写大.
这次把这行注释去掉: fcntl(fds[0], F_SETFL, O_NOATIME); 指示pipe在读的时候不更新atime,看下效果:

$ opreport -l|less
samples  %        app name                 symbol name
599018   95.2466  vmlinux                  .text.acpi_processor_idle
4140      0.6583  vmlinux                  .text.acpi_safe_halt
3281      0.5217  vmlinux                  thread_return
2812      0.4471  vmlinux                  current_fs_time
2615      0.4158  vmlinux                  file_update_time
1790      0.2846  vmlinux                  __mutex_lock_slowpath
1657      0.2635  vmlinux                  timespec_trunc
1341      0.2132  vmlinux                  try_to_wake_up
1281      0.2037  vmlinux                  mutex_unlock
1080      0.1717  vmlinux                  mutex_lock
1001      0.1592  vmlinux                  pipe_readv
925       0.1471  vmlinux                  pipe_writev

这下看不到touch_atime了,开销省了,对于高性能服务器是很重要的.
小结: 细节很重要,记得开文件open的时候设置O_NOATIME或者用fcntl搞定它.
祝玩得开心!

发表在内核 | 已关闭评论

Linux信号signal处理机制

发表于2013年9月7日由Matrix

转载自：http://www.cnblogs.com/taobataoma/archive/2007/08/30/875743.html

信号是Linux编程中非常重要的部分，本文将详细介绍信号机制的基本概念、Linux对信号机制的大致实现方法、如何使用信号，以及有关信号的几个系统调用。

信号机制是进程之间相互传递消息的一种方法，信号全称为软中断信号，也有人称作软中断。从它的命名可以看出，它的实质和使用很象中断。所以，信号可以说是进程控制的一部分。

一、信号的基本概念

本节先介绍信号的一些基本概念，然后给出一些基本的信号类型和信号对应的事件。基本概念对于理解和使用信号，对于理解信号机制都特别重要。下面就来看看什么是信号。

1、基本概念

软中断信号（signal，又简称为信号）用来通知进程发生了异步事件。进程之间可以互相通过系统调用kill发送软中断信号。内核也可以因为内部事件而给进程发送信号，通知进程发生了某个事件。注意，信号只是用来通知某进程发生了什么事件，并不给该进程传递任何数据。

收到信号的进程对各种信号有不同的处理方法。处理方法可以分为三类：第一种是类似中断的处理程序，对于需要处理的信号，进程可以指定处理函数，由该函数来处理。第二种方法是，忽略某个信号，对该信号不做任何处理，就象未发生过一样。第三种方法是，对该信号的处理保留系统的默认值，这种缺省操作，对大部分的信号的缺省操作是使得进程终止。进程通过系统调用signal来指定进程对某个信号的处理行为。

在进程表的表项中有一个软中断信号域，该域中每一位对应一个信号，当有信号发送给进程时，对应位置位。由此可以看出，进程对不同的信号可以同时保留，但对于同一个信号，进程并不知道在处理之前来过多少个。

2、信号的类型

发出信号的原因很多，这里按发出信号的原因简单分类，以了解各种信号：

（1）与进程终止相关的信号。当进程退出，或者子进程终止时，发出这类信号。
（2）与进程例外事件相关的信号。如进程越界，或企图写一个只读的内存区域（如程序正文区），或执行一个特权指令及其他各种硬件错误。
（3）与在系统调用期间遇到不可恢复条件相关的信号。如执行系统调用exec时，原有资源已经释放，而目前系统资源又已经耗尽。
（4）与执行系统调用时遇到非预测错误条件相关的信号。如执行一个并不存在的系统调用。
（5）在用户态下的进程发出的信号。如进程调用系统调用kill向其他进程发送信号。
（6）与终端交互相关的信号。如用户关闭一个终端，或按下break键等情况。
（7）跟踪进程执行的信号。

Linux支持的信号列表如下。很多信号是与机器的体系结构相关的，首先列出的是POSIX.1中列出的信号：

信号值处理动作发出信号的原因
———————————————————————-
SIGHUP 1 A 终端挂起或者控制进程终止
SIGINT 2 A 键盘中断（如break键被按下）
SIGQUIT 3 C 键盘的退出键被按下
SIGILL 4 C 非法指令
SIGABRT 6 C 由abort(3)发出的退出指令
SIGFPE 8 C 浮点异常
SIGKILL 9 AEF Kill信号
SIGSEGV 11 C 无效的内存引用
SIGPIPE 13 A 管道破裂: 写一个没有读端口的管道
SIGALRM 14 A 由alarm(2)发出的信号
SIGTERM 15 A 终止信号
SIGUSR1 30,10,16 A 用户自定义信号1
SIGUSR2 31,12,17 A 用户自定义信号2
SIGCHLD 20,17,18 B 子进程结束信号
SIGCONT 19,18,25 进程继续（曾被停止的进程）
SIGSTOP 17,19,23 DEF 终止进程
SIGTSTP 18,20,24 D 控制终端（tty）上按下停止键
SIGTTIN 21,21,26 D 后台进程企图从控制终端读
SIGTTOU 22,22,27 D 后台进程企图从控制终端写

下面的信号没在POSIX.1中列出，而在SUSv2列出

信号值处理动作发出信号的原因
——————————————————————–
SIGBUS 10,7,10 C 总线错误(错误的内存访问)
SIGPOLL A Sys V定义的Pollable事件，与SIGIO同义
SIGPROF 27,27,29 A Profiling定时器到
SIGSYS 12,-,12 C 无效的系统调用 (SVID)
SIGTRAP 5 C 跟踪/断点捕获
SIGURG 16,23,21 B Socket出现紧急条件(4.2 BSD)
SIGVTALRM 26,26,28 A 实际时间报警时钟信号(4.2 BSD)
SIGXCPU 24,24,30 C 超出设定的CPU时间限制(4.2 BSD)
SIGXFSZ 25,25,31 C 超出设定的文件大小限制(4.2 BSD)

（对于SIGSYS，SIGXCPU，SIGXFSZ，以及某些机器体系结构下的SIGBUS，Linux缺省的动作是A (terminate)，SUSv2 是C (terminate and dump core)）。

下面是其它的一些信号

信号值处理动作发出信号的原因
———————————————————————-
SIGIOT 6 C IO捕获指令，与SIGABRT同义
SIGEMT 7,-,7
SIGSTKFLT -,16,- A 协处理器堆栈错误
SIGIO 23,29,22 A 某I/O操作现在可以进行了(4.2 BSD)
SIGCLD -,-,18 A 与SIGCHLD同义
SIGPWR 29,30,19 A 电源故障(System V)
SIGINFO 29,-,- A 与SIGPWR同义
SIGLOST -,-,- A 文件锁丢失
SIGWINCH 28,28,20 B 窗口大小改变(4.3 BSD, Sun)
SIGUNUSED -,31,- A 未使用的信号(will be SIGSYS)

（在这里，- 表示信号没有实现；有三个值给出的含义为，第一个值通常在Alpha和Sparc上有效，中间的值对应i386和ppc以及sh，最后一个值对应mips。信号29在Alpha上为SIGINFO / SIGPWR ，在Sparc上为SIGLOST。）

处理动作一项中的字母含义如下
A 缺省的动作是终止进程
B 缺省的动作是忽略此信号
C 缺省的动作是终止进程并进行内核映像转储（dump core）
D 缺省的动作是停止进程
E 信号不能被捕获
F 信号不能被忽略

上面介绍的信号是常见系统所支持的。以表格的形式介绍了各种信号的名称、作用及其在默认情况下的处理动作。各种默认处理动作的含义是：终止程序是指进程退出；忽略该信号是将该信号丢弃，不做处理；停止程序是指程序挂起，进入停止状况以后还能重新进行下去，一般是在调试的过程中（例如ptrace系统调用）；内核映像转储是指将进程数据在内存的映像和进程在内核结构中存储的部分内容以一定格式转储到文件系统，并且进程退出执行，这样做的好处是为程序员提供了方便，使得他们可以得到进程当时执行时的数据值，允许他们确定转储的原因，并且可以调试他们的程序。

注意信号SIGKILL和SIGSTOP既不能被捕捉，也不能被忽略。信号SIGIOT与SIGABRT是一个信号。可以看出，同一个信号在不同的系统中值可能不一样，所以建议最好使用为信号定义的名字，而不要直接使用信号的值。

二、信号机制

上一节中介绍了信号的基本概念，在这一节中，我们将介绍内核如何实现信号机制。即内核如何向一个进程发送信号、进程如何接收一个信号、进程怎样控制自己对信号的反应、内核在什么时机处理和怎样处理进程收到的信号。还要介绍一下setjmp和longjmp在信号中起到的作用。

1、内核对信号的基本处理方法

内核给一个进程发送软中断信号的方法，是在进程所在的进程表项的信号域设置对应于该信号的位。这里要补充的是，如果信号发送给一个正在睡眠的进程，那么要看该进程进入睡眠的优先级，如果进程睡眠在可被中断的优先级上，则唤醒进程；否则仅设置进程表中信号域相应的位，而不唤醒进程。这一点比较重要，因为进程检查是否收到信号的时机是：一个进程在即将从内核态返回到用户态时；或者，在一个进程要进入或离开一个适当的低调度优先级睡眠状态时。

内核处理一个进程收到的信号的时机是在一个进程从内核态返回用户态时。所以，当一个进程在内核态下运行时，软中断信号并不立即起作用，要等到将返回用户态时才处理。进程只有处理完信号才会返回用户态，进程在用户态下不会有未处理完的信号。

内核处理一个进程收到的软中断信号是在该进程的上下文中，因此，进程必须处于运行状态。前面介绍概念的时候讲过，处理信号有三种类型：进程接收到信号后退出；进程忽略该信号；进程收到信号后执行用户设定用系统调用signal的函数。当进程接收到一个它忽略的信号时，进程丢弃该信号，就象没有收到该信号似的继续运行。如果进程收到一个要捕捉的信号，那么进程从内核态返回用户态时执行用户定义的函数。而且执行用户定义的函数的方法很巧妙，内核是在用户栈上创建一个新的层，该层中将返回地址的值设置成用户定义的处理函数的地址，这样进程从内核返回弹出栈顶时就返回到用户定义的函数处，从函数返回再弹出栈顶时，才返回原先进入内核的地方。这样做的原因是用户定义的处理函数不能且不允许在内核态下执行（如果用户定义的函数在内核态下运行的话，用户就可以获得任何权限）。

在信号的处理方法中有几点特别要引起注意。第一，在一些系统中，当一个进程处理完中断信号返回用户态之前，内核清除用户区中设定的对该信号的处理例程的地址，即下一次进程对该信号的处理方法又改为默认值，除非在下一次信号到来之前再次使用signal系统调用。这可能会使得进程在调用signal之前又得到该信号而导致退出。在BSD中，内核不再清除该地址。但不清除该地址可能使得进程因为过多过快的得到某个信号而导致堆栈溢出。为了避免出现上述情况。在BSD系统中，内核模拟了对硬件中断的处理方法，即在处理某个中断时，阻止接收新的该类中断。

第二个要引起注意的是，如果要捕捉的信号发生于进程正在一个系统调用中时，并且该进程睡眠在可中断的优先级上，这时该信号引起进程作一次longjmp，跳出睡眠状态，返回用户态并执行信号处理例程。当从信号处理例程返回时，进程就象从系统调用返回一样，但返回了一个错误代码，指出该次系统调用曾经被中断。这要注意的是，BSD系统中内核可以自动地重新开始系统调用。

第三个要注意的地方：若进程睡眠在可中断的优先级上，则当它收到一个要忽略的信号时，该进程被唤醒，但不做longjmp，一般是继续睡眠。但用户感觉不到进程曾经被唤醒，而是象没有发生过该信号一样。

第四个要注意的地方：内核对子进程终止（SIGCLD）信号的处理方法与其他信号有所区别。当进程检查出收到了一个子进程终止的信号时，缺省情况下，该进程就象没有收到该信号似的，如果父进程执行了系统调用wait，进程将从系统调用wait中醒来并返回wait调用，执行一系列wait调用的后续操作（找出僵死的子进程，释放子进程的进程表项），然后从wait中返回。SIGCLD信号的作用是唤醒一个睡眠在可被中断优先级上的进程。如果该进程捕捉了这个信号，就象普通信号处理一样转到处理例程。如果进程忽略该信号，那么系统调用wait的动作就有所不同，因为SIGCLD的作用仅仅是唤醒一个睡眠在可被中断优先级上的进程，那么执行wait调用的父进程被唤醒继续执行wait调用的后续操作，然后等待其他的子进程。

如果一个进程调用signal系统调用，并设置了SIGCLD的处理方法，并且该进程有子进程处于僵死状态，则内核将向该进程发一个SIGCLD信号。

2、setjmp和longjmp的作用

前面在介绍信号处理机制时，多次提到了setjmp和longjmp，但没有仔细说明它们的作用和实现方法。这里就此作一个简单的介绍。

在介绍信号的时候，我们看到多个地方要求进程在检查收到信号后，从原来的系统调用中直接返回，而不是等到该调用完成。这种进程突然改变其上下文的情况，就是使用setjmp和longjmp的结果。setjmp将保存的上下文存入用户区，并继续在旧的上下文中执行。这就是说，进程执行一个系统调用，当因为资源或其他原因要去睡眠时，内核为进程作了一次setjmp，如果在睡眠中被信号唤醒，进程不能再进入睡眠时，内核为进程调用longjmp，该操作是内核为进程将原先setjmp调用保存在进程用户区的上下文恢复成现在的上下文，这样就使得进程可以恢复等待资源前的状态，而且内核为setjmp返回1，使得进程知道该次系统调用失败。这就是它们的作用。

三、有关信号的系统调用

前面两节已经介绍了有关信号的大部分知识。这一节我们来了解一下这些系统调用。其中，系统调用signal是进程用来设定某个信号的处理方法，系统调用kill是用来发送信号给指定进程的。这两个调用可以形成信号的基本操作。后两个调用pause和alarm是通过信号实现的进程暂停和定时器，调用alarm是通过信号通知进程定时器到时。所以在这里，我们还要介绍这两个调用。

1、signal 系统调用

系统调用signal用来设定某个信号的处理方法。该调用声明的格式如下：
void (*signal(int signum, void (*handler)(int)))(int);
在使用该调用的进程中加入以下头文件：
#include <signal.h>

上述声明格式比较复杂，如果不清楚如何使用，也可以通过下面这种类型定义的格式来使用（POSIX的定义）：
typedef void (*sighandler_t)(int);
sighandler_t signal(int signum, sighandler_t handler);
但这种格式在不同的系统中有不同的类型定义，所以要使用这种格式，最好还是参考一下联机手册。

在调用中，参数signum指出要设置处理方法的信号。第二个参数handler是一个处理函数，或者是
SIG_IGN：忽略参数signum所指的信号。
SIG_DFL：恢复参数signum所指信号的处理方法为默认值。

传递给信号处理例程的整数参数是信号值，这样可以使得一个信号处理例程处理多个信号。系统调用signal返回值是指定信号signum前一次的处理例程或者错误时返回错误代码SIG_ERR。下面来看一个简单的例子：

#include <signal.h>
#include <unistd.h>
#include <stdio.h>
void sigroutine(int dunno) { /* 信号处理例程，其中dunno将会得到信号的值 */
switch (dunno) {
case 1:
printf(“Get a signal — SIGHUP “);
break;
case 2:
printf(“Get a signal — SIGINT “);
break;
case 3:
printf(“Get a signal — SIGQUIT “);
break;
}
return;
}

int main() {
printf(“process id is %d “,getpid());
signal(SIGHUP, sigroutine); //* 下面设置三个信号的处理方法
signal(SIGINT, sigroutine);
signal(SIGQUIT, sigroutine);
for (;;) ;
}

其中信号SIGINT由按下Ctrl-C发出，信号SIGQUIT由按下Ctrl-发出。该程序执行的结果如下：

localhost:~$ ./sig_test
process id is 463
Get a signal -SIGINT //按下Ctrl-C得到的结果
Get a signal -SIGQUIT //按下Ctrl-得到的结果
//按下Ctrl-z将进程置于后台
[1]+ Stopped ./sig_test
localhost:~$ bg
[1]+ ./sig_test &
localhost:~$ kill -HUP 463 //向进程发送SIGHUP信号
localhost:~$ Get a signal – SIGHUP
kill -9 463 //向进程发送SIGKILL信号，终止进程
localhost:~$

2、kill 系统调用

系统调用kill用来向进程发送一个信号。该调用声明的格式如下：
int kill(pid_t pid, int sig);
在使用该调用的进程中加入以下头文件：
#include <sys/types.h>
#include <signal.h>

该系统调用可以用来向任何进程或进程组发送任何信号。如果参数pid是正数，那么该调用将信号sig发送到进程号为pid的进程。如果pid等于0，那么信号sig将发送给当前进程所属进程组里的所有进程。如果参数pid等于-1，信号sig将发送给除了进程1和自身以外的所有进程。如果参数pid小于- 1，信号sig将发送给属于进程组-pid的所有进程。如果参数sig为0，将不发送信号。该调用执行成功时，返回值为0；错误时，返回-1，并设置相应的错误代码errno。下面是一些可能返回的错误代码：
EINVAL：指定的信号sig无效。
ESRCH：参数pid指定的进程或进程组不存在。注意，在进程表项中存在的进程，可能是一个还没有被wait收回，但已经终止执行的僵死进程。
EPERM：进程没有权力将这个信号发送到指定接收信号的进程。因为，一个进程被允许将信号发送到进程pid时，必须拥有root权力，或者是发出调用的进程的UID 或EUID与指定接收的进程的UID或保存用户ID（savedset-user-ID）相同。如果参数pid小于-1，即该信号发送给一个组，则该错误表示组中有成员进程不能接收该信号。

3、pause系统调用

系统调用pause的作用是等待一个信号。该调用的声明格式如下：
int pause(void);
在使用该调用的进程中加入以下头文件：
#include <unistd.h>

该调用使得发出调用的进程进入睡眠，直到接收到一个信号为止。该调用总是返回-1，并设置错误代码为EINTR（接收到一个信号）。下面是一个简单的范例：

#include <unistd.h>
#include <stdio.h>
#include <signal.h>
void sigroutine(int unused) {
printf(“Catch a signal SIGINT “);
}

int main() {
signal(SIGINT, sigroutine);
pause();
printf(“receive a signal “);
}

在这个例子中，程序开始执行，就象进入了死循环一样，这是因为进程正在等待信号，当我们按下Ctrl-C时，信号被捕捉，并且使得pause退出等待状态。

4、alarm和 setitimer系统调用

系统调用alarm的功能是设置一个定时器，当定时器计时到达时，将发出一个信号给进程。该调用的声明格式如下：
unsigned int alarm(unsigned int seconds);
在使用该调用的进程中加入以下头文件：
#include <unistd.h>

系统调用alarm安排内核为调用进程在指定的seconds秒后发出一个SIGALRM的信号。如果指定的参数seconds为0，则不再发送 SIGALRM信号。后一次设定将取消前一次的设定。该调用返回值为上次定时调用到发送之间剩余的时间，或者因为没有前一次定时调用而返回0。

注意，在使用时，alarm只设定为发送一次信号，如果要多次发送，就要多次使用alarm调用。

对于alarm，这里不再举例。现在的系统中很多程序不再使用alarm调用，而是使用setitimer调用来设置定时器，用getitimer来得到定时器的状态，这两个调用的声明格式如下：
int getitimer(int which, struct itimerval *value);
int setitimer(int which, const struct itimerval *value, struct itimerval *ovalue);
在使用这两个调用的进程中加入以下头文件：
#include <sys/time.h>

该系统调用给进程提供了三个定时器，它们各自有其独有的计时域，当其中任何一个到达，就发送一个相应的信号给进程，并使得计时器重新开始。三个计时器由参数which指定，如下所示：
TIMER_REAL：按实际时间计时，计时到达将给进程发送SIGALRM信号。
ITIMER_VIRTUAL：仅当进程执行时才进行计时。计时到达将发送SIGVTALRM信号给进程。
ITIMER_PROF：当进程执行时和系统为该进程执行动作时都计时。与ITIMER_VIR-TUAL是一对，该定时器经常用来统计进程在用户态和内核态花费的时间。计时到达将发送SIGPROF信号给进程。

定时器中的参数value用来指明定时器的时间，其结构如下：
struct itimerval {
struct timeval it_interval; /* 下一次的取值 */
struct timeval it_value; /* 本次的设定值 */
};

该结构中timeval结构定义如下：
struct timeval {
long tv_sec; /* 秒 */
long tv_usec; /* 微秒，1秒 = 1000000 微秒*/
};

在setitimer 调用中，参数ovalue如果不为空，则其中保留的是上次调用设定的值。定时器将it_value递减到0时，产生一个信号，并将it_value的值设定为it_interval的值，然后重新开始计时，如此往复。当it_value设定为0时，计时器停止，或者当它计时到期，而it_interval 为0时停止。调用成功时，返回0；错误时，返回-1，并设置相应的错误代码errno：
EFAULT：参数value或ovalue是无效的指针。
EINVAL：参数which不是ITIMER_REAL、ITIMER_VIRT或ITIMER_PROF中的一个。

下面是关于setitimer调用的一个简单示范，在该例子中，每隔一秒发出一个SIGALRM，每隔0.5秒发出一个SIGVTALRM信号：

#include <signal.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/time.h>
int sec;

void sigroutine(int signo) {
switch (signo) {
case SIGALRM:
printf(“Catch a signal — SIGALRM “);
break;
case SIGVTALRM:
printf(“Catch a signal — SIGVTALRM “);
break;
}
return;
}

int main() {
struct itimerval value,ovalue,value2;
sec = 5;

printf(“process id is %d “,getpid());
signal(SIGALRM, sigroutine);
signal(SIGVTALRM, sigroutine);

value.it_value.tv_sec = 1;
value.it_value.tv_usec = 0;
value.it_interval.tv_sec = 1;
value.it_interval.tv_usec = 0;
setitimer(ITIMER_REAL, &value, &ovalue);

value2.it_value.tv_sec = 0;
value2.it_value.tv_usec = 500000;
value2.it_interval.tv_sec = 0;
value2.it_interval.tv_usec = 500000;
setitimer(ITIMER_VIRTUAL, &value2, &ovalue);

for (;;) ;
}

该例子的屏幕拷贝如下：

localhost:~$ ./timer_test
process id is 579
Catch a signal – SIGVTALRM
Catch a signal – SIGALRM
Catch a signal – SIGVTALRM
Catch a signal – SIGVTALRM
Catch a signal – SIGALRM
Catch a signal –GVTALRM

本文简单介绍了Linux下的信号，如果希望了解其他调用，请参考联机手册或其他文档。

发表在内核 | 已关闭评论

Nginx模块开发 – C++模块

发表于2013年9月7日由Matrix

Nginx使用纯C语言开发，默认使用GCC编译，如何使用C++开发Nginx模块或者在开发的Nginx模块中引入C++库呢？

首先介绍下C和C++混合编程的背景知识。

我们知道C++ 是在 C 语言的基础上发展起来的。在某种程度上，我们可将 C++ 看做 C 的一种扩展。在本质上，二者的数据类型和函数调用惯例都是一致的， C 与 C++ 混合编译也是很自然的事情。

二者的区别仅在于编译后函数的名字不同──C 简单地使用函数名而不考虑参数的个数或类型，而 C++ 编译后的函数名则总是将参数类型列表作为其一部分。尽管如此，C++ 提供了特殊的机制来声明 C 函数，这意味着一个 C++ 程序可以直接声明和调用 C 函数。

C++调用C函数

C++调用C实现的函数func_implement_by_c时，在C++代码里的函数声明必须用extern “C” 前缀：

/*
 * file name: cpp_caller.cpp
 */
extern "C" void func_implement_by_c(char *str);
void cpp_caller()
{
    func_implement_by_c("I'm from CPP");
}

int main()
{
    cpp_caller();
}

C调用C++函数

C如果要调用C++的代码，特别是要调用类的成员函数，一般需要把C++函数封装成C样式的函数：

/**
 * file name: c_wapper_of_member_func.cpp
 */
void c_wapper_of_member_func(char *str)
{
    class TestClass;

    TestClass a;

    a.member_func(str);
}

在C中调用时，也是需要在函数声明前加extern “C”前缀：

/**
 * file name: c_caller.c
 */
extern "C" void c_wrapper_of_member_func(char *str);
void c_caller()
{
   c_wrapper_of_member_func("I'm from C");
}

int main()
{
   c_caller();
}

下面我们理一下Nginx C++模块相关解决方案的设计思路。

首先，不要试图用C++编译器（如G++）来编译Nginx的官方代码，这会带来大量的不可控错误。正确的做法是仍然用C编译器来编译Nginx官方提供的各模块，而用C++编译器来编译用C++语言开发的模块，并打包成静态库。然后利用C和C++混合编程的思路，将C++写成的模块封装成C样式的接口来和Nginx的其他模块进行交互。

1. C++模块使用Nginx的头文件，需要使用extern “C” 包起来

#ifdef __cplusplus
extern "C" {
#endif

#include <ngx_config.h>
#include <ngx_core.h>
#include <ngx_http.h>
#include <nginx.h>

#ifdef __cplusplus
}
#endif

2. 将C++模块封装成C样式的接口，将模块注册到Nginx中

3. 修改C++模块的配置文件（模块目录下的config文件），实现最终的编译

在CORE_INCS中条件增加头文件路径，在CORE_LIBS中增加库文件的路径，并增加”-lstdc++”链接选项：

比如，我们用C++实现的Nginx模块代码最终打包成静态库libcpp_module.a，则配置文件修改如下：

ngx_feature="http_push_stream_module"
ngx_feature_name=
ngx_feature_run=no
ngx_feature_incs=
ngx_feature_path=
ngx_feature_libs=
ngx_feature_test=

CPP_MODULE_DIR="/data/packages/nginx-push-stream-module/cpp-module"

ngx_addon_name=ngx_http_push_stream_module
HTTP_MODULES="$HTTP_MODULES ngx_http_push_stream_module"
CORE_INCS="$CORE_INCS \
    $ngx_addon_dir/src \
    $ngx_addon_dir/include \
    $CPP_MODULE_DIR"
CORE_LIBS="$CORE_LIBS -L$CPP_MODULE_DIR -lstdc++ -lcpp_module"
NGX_ADDON_SRCS="$NGX_ADDON_SRCS \
    ${ngx_addon_dir}/src/ngx_http_push_stream_module.c"

have=NGX_HTTP_HEADERS . auto/have
. auto/feature

#if not have sha1 or do not want to use WebSocket comment the lines bellow
USE_SHA1=YES
have=NGX_HAVE_SHA1 . auto/have

发表在 Nginx | 已关闭评论

Data Organization in InnoDB

Introduction

The Files

Tablespaces

Pages

Extents

Page Types

Tablespace Header

Reserved Pages of Tablespace

File Segments

Table

Conclusion

分布式原理: 一致性&持久性

Linux下pipe使用注意事项

Linux信号signal处理机制

Nginx模块开发 – C++模块

C++调用C函数

C调用C++函数

1. C++模块使用Nginx的头文件，需要使用extern “C” 包起来

2. 将C++模块封装成C样式的接口，将模块注册到Nginx中

3. 修改C++模块的配置文件（模块目录下的config文件），实现最终的编译

近期文章

近期评论

归档

近期文章

分类

归档

其他操作

其他操作

2026 年 7 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Data Organization in InnoDB

Introduction

The Files

Tablespaces

Pages

Extents

Page Types

Tablespace Header

Reserved Pages of Tablespace

File Segments

Table

Conclusion

分布式原理: 一致性&持久性

Linux下pipe使用注意事项

Linux信号signal处理机制

Nginx模块开发 – C++模块

C++调用C函数

C调用C++函数

1. C++模块使用Nginx的头文件，需要使用extern “C” 包起来

2. 将C++模块封装成C样式的接口，将模块注册到Nginx中

3. 修改C++模块的配置文件（模块目录下的config文件），实现最终的编译

近期文章

近期评论

归档

标签

近期文章

分类

归档

其他操作