zookeeper使用场景

2017年4月7日 评论已被关闭

原理一、 zookeeper原理

zk service网络结构
zookeeper的工作集群可以简单分成两类,一个是Leader,唯一一个,其余的都是follower,如何确定Leader是通过内部选举确定的。

Leader和各个follower是互相通信的,对于zk系统的数据都是保存在内存里面的,同样也会备份一份在磁盘上。对于每个zk节点而言,可以看做每个zk节点的命名空间是一样的,也就是有同样的数据(下面的树结构)

  • 如果Leader挂了,zk集群会重新选举,在毫秒级别就会重新选举出一个Leaer
  • 集群中除非有一半以上的zk节点挂了,zk service才不可用

zk命名空间结构

zk的命名空间就是zk应用的文件系统,它和linux的文件系统很像,也是树状,这样就可以确定每个路径都是唯一的,对于命名空间的操作必须都是绝对路径操作。与linux文件系统不同的是,linux文件系统有目录和文件的区别,而zk统一叫做znode,一个znode节点可以包含子znode,同时也可以包含数据。

比如/Nginx/conf,/是一个znode,/Nginx是/的子znode,/Nginx还可以包含数据,数据内容就是所有安装Nginx的机器IP,/Nginx/conf是/Nginx子znode,它也可以包含内容,数据就是Nginx的配置文件内容。在应用中,我们可以通过这样一个路径就可以获得所有安装Nginx的机器IP列表,还可以获得这些机器上Nginx的配置文件。

zk读写数据

  • 写数据,一个客户端进行写数据请求时,会指定zk集群中节点,如果是follower接收到写请求,就会把请求转发给Leader,Leader通过内部的Zab协议进行原子广播,直到所有zk节点都成功写了数据后(内存同步以及磁盘更新),这次写请求算是完成,然后zk service就会给client发回响应

 

  • 读数据,因为集群中所有的zk节点都呈现一个同样的命名空间视图(就是结构数据),上面的写请求已经保证了写一次数据必须保证集群所有的zk节点都是同步命名空间的,所以读的时候可以在任意一台zk节点上

 

  • ps:其实写数据的时候不是要保证所有zk节点都写完才响应,而是保证一半以上的节点写完了就把这次变更更新到内存,并且当做最新命名空间的应用。所以在读数据的时候可能会读到不是最新的zk节点,这时候只能通过sync()解决。这里先不考虑了,假设整个zk service都是同步meta信息的,后面的文章再讨论。

zk znode类型
zk中znode的节点创建时候是可以指定类型的,主要有下面几种类型。

  • PERSISTENT:持久化znode节点,一旦创建这个znode点存储的数据不会主动消失,除非是客户端主动的delete。

 

  • SEQUENCE:顺序增加编号znode节点,比如ClientA去zk service上建立一个znode名字叫做/Nginx/conf,指定了这种类型的节点后zk会创建/Nginx/conf0000000000,ClientB再去创建就是创建/Nginx/conf0000000001,ClientC是创建/Nginx/conf0000000002,以后任意Client来创建这个znode都会得到一个比当前zk命名空间最大znode编号+1的znode,也就说任意一个Client去创建znode都是保证得到的znode是递增的,而且是唯一的。

 

  • EPHEMERAL:临时znode节点,Client连接到zk service的时候会建立一个session,之后用这个zk连接实例创建该类型的znode,一旦Client关闭了zk的连接,服务器就会清除session,然后这个session建立的znode节点都会从命名空间消失。总结就是,这个类型的znode的生命周期是和Client建立的连接一样的。比如ClientA创建了一个EPHEMERAL的/Nginx/conf0000000011的znode节点,一旦ClientA的zk连接关闭,这个znode节点就会消失。整个zk service命名空间里就会删除这个znode节点。

 

  • PERSISTENT|SEQUENTIAL:顺序自动编号的znode节点,这种znoe节点会根据当前已近存在的znode节点编号自动加 1,而且不会随session断开而消失。

 

  • EPHEMERAL|SEQUENTIAL:临时自动编号节点,znode节点编号会自动增加,但是会随session消失而消失

原理二、zookeeper中Watcher和Notifications

在阅读之前首先明确个概念:

1.什么是znode?


2.什么是客户端?

我们使用znode这个术语来表示ZooKeeper的数据节点

znode维持一个stat结构,它包含数据变化的版本号、ACL变化和时间戳,以允许cache校验和协调化的更新。每当znode的数据变化时,版本号将增加。一个客户端收到数据时,它也会收到数据的版本号。

保存在每个znode中的数据都是自动读写的。读操作获取znode的所有数据,写操作替换掉znode的所有数据。每个节点有一个访问控制表(ACL)来限制谁能做哪些操作。

 

Zookeeper中的角色主要有以下三类,如下表所示:
系统模型如图所示:

 

传统轮询远程service服务
传统远程的service往往是这样服务的,服务提供者在远程service注册自己的服务,服务调用者不断去远程service轮询看看是否服务提供者有没有提供服务或者更新服务。所以有弊端,就是延时比较高,而且因为很多不必要的空轮询带来高的负载和网络损耗,这种模式到zk里面就应该是这样。

 

 

zk中异步回调服务

zk实际上的实现是异步回调来代替polling,引入一种机制是event inotifications:客户端首先注册到zk service从而可以接收znode的改变事件,也就是说一旦watch的znode变更了,客户端就会得到相应的通知,然后处理自己的业务逻辑。

 

  • zk客户端可以通过exists,getChildren,getData可以注册观察者,观察者说白了就是指定一个callback
  • 那么观察者什么时候调用呢?一旦监听的znode变化了,zk service就会发送对应znode路径给客户端,客户端调用相应的之前注册的回掉函数处理。对于节点的create,delete,setData都会触发观察者,也就是这个callback()函数。

服务端只存储事件的信息,客户端存储事件的信息和Watcher的执行逻辑。客户端在注册watcher的时候,在客户端本地会维持对应znode path和callback()的对应关系。在服务端会维护对应连接session以及znode path和事件类型。服务端触发了对应的事件类型后,会发送给客户端事件类型和znode path,在客户端会根据映射关系调用相应的callback(),接下来的业务逻辑都是在客户端实现的。

zk中watcher单次触发问题

zk中的Watch是一次触发的,一次变更只会触发一个通知,要想下次还得到通知,就需要重新注册。为什么不是永久注册Watcher呢?这主要是考虑到性能上面的影响吧。看下面的情况

  • Client C1对于znode /Task 设置了一个watcher
  • Client C2来到然后对 /Task 增加znode
  • Client C1接收到了notifications,得知监控的znode变化了
  • Client C1在处理这个notifications,这时候Client C3又增加 /task 一个znode

在步骤3的时候C1已经触发了一次watcher,步骤四的时候没有watcher了,除非重新设置watcher,所以这个过过程中就会丢失一个notifications,这就是涉及到了CAP原理了。zookeeper只能保证最终一致性,不能保证强一致性,但是因为zk保证了顺序一致性,所以就能确保最终一致性。

  • 强一致性:分布式系统里面一个数据变更后,访问任一个服务都可以得到最新的数据
  • 弱一致性:一个数据变更后,其中部分服务可以得到最新数据,部分服务不能
  • 最终一致性:在更新某个数据后,可能在开始的时候得不到最新的数据,但是最终是可以呈现最新的数据
  • 顺序一致性:更新N份数据,能保证是服务是按照N份数据顺序更新提供服务的。

其实对于上面的case也是有办法解决的,具体就是每次在注册watcher之后都getData,保证数据版本是最新的,但相比较传统的polling优势还是很明显的。

zk中Versions

每个znode一旦数据变化,都会有一个递增的版本号,在zk API执行的时候都需要指定版本号,客户端提供的版本号只有和服务端匹配了才能进行znode操作。在多个客户端都要操作同一个znode的时候版本号就很重要了。看下面的情况。

  • 比如Client C1写了一个znode /Nginx/conf的数据,写了一些配置信息,这时候/Nginx/conf版本号就从version1变成version2
  • 在上面的同时,Client C2也想写/Nginx/conf,因为C2的客户端版本还是version1,而服务端已经是version2了,此刻就会冲突,这个操作就会以失败告终。所以必须要先更新C2上到version2,然后再提交操作。
zk上更新version2到version3,C2本地更新至version3

 


适用场景一、如何竞选Master
在zookeeper应用场景提出了对于Master节点管理的问题,如何保证集群中Master可用性和唯一性,下面就利用zookeeper来实现。
在确保zookeeper集群节点安装配置的前提下,假设zk已经对外提供了正常的服务,通过下面的步骤来实现Master竞选

  • Client连接到zk上,判断znode /Roles/workers是否存在,不存在则建立,znode的类型是PERSISTENT类型,保证不会随着C1的session断开而消失。
  • Client在/Roles/workers下面建立一个SEQUENCE|EPHEMERAL类型的znode,前缀可以是worker,由zk保证znode编号是递增而且是暂时的,EPHEMERAL在前文说了,一旦session断开创建的znode也会消失。
  • Client通过getChildren获取所有的/Roles/workers下znode列表,并且设置一个Watcher等待通知,返回值有多少个znode数量就对应Client来竞选。
  • 对于步骤4返回的节点列表进行排序,找到最小的worker编号,如果是和自己创建的一致(步骤2返回值),那么就代表自己的编号是最小的,自己就是Master。如果发现自己的编号不是最小,那么就等待通知,一旦Watcher触发,就在Watcher回到步骤3。

上面的机制主要利用了zk的几个特性

  • 对于N个客户端同时请求create一个znode,zk能保证顺序的一致性,并且保证每个客户端创建的znode节点是递增并且唯一。
  • 因为创建的znode是临时的,一旦session断开,那么znode就会从zk上消失,从而给每个设置Watcher的客户端发送通知,让每个客户端重新竞选Master,编号小的肯定是Master,保证了唯一性。

下图是上面的逻辑图

下面是实现的代码,默认是连接本地的zk服务,端口是2181,zkclient模块位于zookeeper python接口只需要运行多个下面的脚本就会能实现Master的竞选。

  • 先后在三个终端上面运行下面的脚本,模拟为c1,c2,c3三个client,创建的节点依次是/Roles/workers/worker0000000000,/Roles/workers/worker0000000001,依次是/Roles/workers/worker0000000002
  • 发现c1成功竞选了Master,然后c2和c3都是slave
  • 把c1关了从而导致依次是/Roles/workers/worker0000000000消失,一段时间后c2和c3会重新竞选,c2会成为master,c3是slave
  • 重新启动c1,发现c1立马加入集群,消息里面变更表示创建了新的znode依次是/Roles/workers/worker0000000003,重新竞选,c2还是master

PS:上面步骤3里面一个客户端关闭后经历了一段时间znode才会删除,原因是这段时间内zk的session还没有被清除,因为关闭是通过ctrl+c关闭的。但是加了一个客户端,znode里面创建,就会通知其余注册了watcher的客户端


适用场景二、分布式锁实现

在zookeeper应用场景有关于分布式集群配置文件同步问题的描述,设想一下如果有100台机器同时对同一台机器上某个文件进行修改,如何才能保证文本不会被写乱,这就是最简单的分布式锁,本文介绍利用zk实现分布式锁。下面是写锁的实现步骤

分布式写锁
create一个PERSISTENT类型的znode,/Locks/write_lock

  • 客户端创建SEQUENCE|EPHEMERAL类型的znode,名字是lockid开头,创建的znode是/Locks/write_lock/lockid0000000001
  • 调用getChildren()不要设置Watcher获取/Locks/write_lock下的znode列表
  • 判断自己步骤2创建znode是不是znode列表中最小的一个,如果是就代表获得了锁,如果不是往下走
  • 调用exists()判断步骤2自己创建的节点编号小1的znode节点(也就是获取的znode节点列表中最小的znode),并且设置Watcher,如果exists()返回false,执行步骤3
  • 如果exists()返回true,那么等待zk通知,从而在回掉函数里返回执行步骤3

释放锁就是删除znode节点或者断开连接就行

*注意:上面步骤2中getChildren()不设置Watcher的原因是,防止羊群效应,如果getChildren()设置了Watcher,那么集群一抖动都会收到通知。在整个分布式锁的竞争过程中,大量重复运行,并且绝大多数的运行结果都是判断出自己并非是序号最小的节点,从而继续等待下一次通知—,这个显然看起来不怎么科学。客户端无端的接受到过多的和自己不相关的事件通知,这如果在集群规模大的时候,会对Server造成很大的性能影响,并且如果一旦同一时间有多个节点的客户端断开连接,这个时候,服务器就会像其余客户端发送大量的事件通知——这就是所谓的羊群效应。

分类: 分布式系统理论 标签:

粗略看封邮件

2017年3月14日 评论已被关闭

原谅我直接引用这些邮件。
这是CFS刚出来时(V8),一位国人的分析,当时还在UMASS读研究生.

Hi, Ingo

My name is Ting Yang, a graduate student from UMASS. I am currently
studying the linux scheduler and virtual memory manager to solve some
page swapping problems. I am very excited with the new scheduler CFS.
After I read through your code, I think that you might be interested in
reading this paper:

“A Proportional Share REsource Allocation Algorithm for Real-Time,
Time-Shared Systems”, by Ion Stoica. You can find the paper here:
http://citeseer.ist.psu.edu/37752.html

Authors of this paper proposed a scheduler: Earlist Eligible Virtual
Deadline First (EEVDF). EEVDF uses exactly the same method as CFS to
track the execution of each running task. The only difference between
EEVDF and CFS is that EEVDF tries to _deadline_ fair while CFS is
_start-time_ fair. Scheduling based on deadline gives better reponse
time bound and seems to more fair.

In the following part of this email, I will try to explain the
similarities and differences between EEVDF and CFS. Hopefully, this
might provide you with some useful information w.r.t your current work
on CFS.

Similarities:
1. both EEVDF and CFS use virtual time to track the system’s progress.

CFS maintains rq->fair_clock for each cpu, which is updated every
tick by:
SCALE/sum(p_i->loadweight)
where p_i->loadweight is the weight of each task mapped from its nice
value in prio_to_load_shift[], given that the default weight is SCALE (1024)

EEVDF maintains a virtual time, which is advanced every tick by:
1/sum(w_i)
where w_i is the weight of each task, given that the default weight is 1.

Therefore, EEVDF and CFS monitors system progress in the same way,
except normalized to different scale.

2. EEVDF and CFS monitors the progress in the same way.

CFS maintains a p->fair_key which indicating the amount of work done
by this task. When p executes for a period t, then p->fair_key
incremented by:
t * SCALE/p->loadweight //the default weight is SCALE
(based on solving equations in your code :-))

EEVDF does the same thing with assumption that the default weight is
1, it uses the same equation to calculate deadlines for running tasks.

Differences:
The main difference between CFS and EEVDF lies in the scheduling
policy, although they follow the same model for tracking tasks.

*** CFS: When a task starts, it gets p->fair_key from the current
virtual time rq->fair_clock. It will not be scheduled for execution
until all other running tasks’ fair_key go beyond p->fair_key by certain
virtual ticks (which is 5 ticks for the current setting of CFS).

(I wish I could show examples with graphs, instead of my not-so-good
english, but I am not sure if it appropriate to attach figures on this
maillist)

EXAMPLE: assume the system runs at 1000 tick/second, i.e. 1ms a tick,
and the granularity of pre-exemption for CFS is 5 virtual ticks (the
current setting). If, at time t=0, we start 2 tasks: p1 and p2, both
have nice value 0 (weight 1024), and rq->fair_clock is initialized to 0.
Now we have:
p1->fair_key = p2->fair_key = rq->fair_clock = 0.
CFS breaks the tie arbitrarily, say it executes p1. After 1 system tick
(1ms later) t=1, we have:
rq->fair_clock = 1/2, p1->fair_key = 1, p2->fair_key = 0.
Suppose, a new task p3 starts with nice value -10 at this moment, that
is p3->fair_key=1/2. In this case, CFS will not schedule p3 for
execution until the fair_keys of p1 and p2 go beyond 5+1/2 (which
translates to about 10ms later in this setting), _regardless_ the
priority (weight) of p3. Further imagine, if we starts n tasks at time
t=0 and then start a task p_(n+1) at time t = 1, the delay of task
p_(n+1) actually is proportional to the number of running tasks n.

Formally speaking, CFS can has a *O(n)* lag w.r.t the ideal
proportional shared fluid-flow system, which can be arbitrarily fine
granularity. The cause of this problem is that p->fair_key only reflects
a fair point that a task should be started, but does not has any
information about how urgent the task is (i.e. the priority or weight of
the task).

*** In EEVDF, each task p_i is specified by 2 parameters: weight w_i
and timeslice length l_i. EEVDF tries to schedule tasks based on the
virtual deadline vd_i where a timeslice l_i should be finished.
EEVDF keeps a virtual start (ve_i) and virtual deadline (vd_i) for
each tasks. When a tasks starts, its ve_i is initialized to be the
current virtual time, and calculates its virtual deadline as:
vd_i = ve_i + l_i/w_i //the same method as CFS updates fair_key.
When l_i amount of work is done, the just finished vd_i becomes the new
ve_i. That is the virtual start time of next l_i amount work is the
deadline of previous finished timeslice. The virtual deadline vd_i is
then updated using the above equation.

EEVDF schedule policy: always executes the task with the _earliest_
virtual deadline.

EXAMPLE: Assume the system has 1000 ticks per second. At time t = 0,
we start 2 tasks: p1 and p2, such that w_1 = w_2 = 1 and l_1 = l_2 = 5
ticks, i.e 5ms (which reflects the similar setting in CFS case).
Furthermore, the system virtual time vt is initialized to be 0. Now at
time t = 0, we have
vt = 0,
ve_1 = 0, vd_1 = ve_1 + l_1/w_1 = 5
ve_2 = 0, vd_2 = vd_2 + l_2/w_2 = 5
As p1 and p2 have the same virtual deadline, EEVDF breaks the tie
arbitrarily, say it executes p1. After 1 system tick (1ms later), we have:
vt = 0 + 1 / (w_1 + w_2) = 1/2 //just as CFS updates rq->fair_clock
ve_1 = 0, vd_1 = 5 //not changed
ve_2 = 0, vd_1 = 5 //not changed
Suppose, we starts a new task p2 at this moment, with weight w_3 = 2 and
timeslice l_3 = 5 ticks (5ms), Then
ve_3 = vt = 1/2
vd_3 = ve_3 + l_3/w_2 = 1/2 + 5/2 = 3
hmmm, the scheduler now will execute task p3 first since it has the
earliest deadline. The deadline implicitly contains some very important
information that the CFS fair_key does not have: how urgent this amount of
work has to be done, which comes from the weight of a task.

Formally speaking, EEVDF has a constant O(1) lag w.r.t the ideal
proportional shared fluid-flow system. (Please look at the paper for
detail proofs.) Under normal cases, for a task p_i, EEVDF ensures that
it does not fall behind the ideal system by more than l_i time.
Occasionally, the delay can be max(l_i), the maximum timeslice used by
all active tasks, due to the dynamic entering and leaving of tasks.
(Again, the paper give more detailed explanation on this).

We can see that here the timeslice l_i used by a task p_i actually
controls accuracy of scheduling p_i. The smaller l_i, the closer to the
ideal system during p_i’s execution. For example, in above case, if p3
has w_3 = 1 (the same as p1 and p2) and l_3 = 3 (3ms). Since vd_3 = 1/2
+ l_3/w_3 = 7/2, the scheduler still executes p3 first, even though
p1,p2 and p3 have the same weight. Smaller l_i leads the scheduler to
handle p_i in finer granularity. Intuitively, it makes scheduler checks
the deadline of p_i more frequently
and ensures each small piece of work is done on time, while larger l_i
does the reverse.

The decouple of weight w_i and timeslice l_i is important. Generally
speaking, weight determines throughput and timeslice determines the
responsiveness of a task. In normal situation, high priority tasks
usually need more cpu capacity within short period of time (bursty, such
as keyboard, mouse move, X updates, daemons, etc), and need to be
processed as quick as possible (responsiveness and interactiveness).
Follow the analysis above, we know that for higher priority tasks we
should give _higher weight_ to ensure its CPU throughput, and at the
same time give _smaller timeslice_ to ensure better responsiveness.
This is a bit counter-intuitive against the current linux
implementation: smaller nice value leads to higher weight and larger
timeslice.

Now let’s see what happens in the Real-Time world. Usually a real-time
application is specified with (C_i, T_i), where C_i is the amount of
work and T_i is the period of time that the work has to be finished.
For example, (20ms, 50ms) says the scheduler has to do 20ms work within
a window of 50ms for this task. Smaller T_i gives tighter response
constraints. Note that Bi = Ci/Ti is really the CPU proportion for the
task during its execution. In our proportional time share world, weight
w_i decides the cpu proportion which maps to Bi, and timeslice l_i gives
the amount works should be done each time, which maps Ci. Then using w_i
and l_i, we can get a window size, which actually maps Ti. Therefore
smaller l_i leads to smaller Ti which means tighter response
constraints. It seems to me all these make a lot sense in both
proportional time share and real-time world.

Based on my understanding, adopting something like EEVDF in CFS should
not be very difficult given their similarities, although I do not have
any idea on how this impacts the load balancing for SMP. Does this worth
a try?

Sorry for such a long email 🙂

Ting

Ingo对他的分析有赞:

Well, this is a difference but note that it’s far from being the ‘only
difference’ between CFS and EEVDF:

– in CFS you have to “earn” your right to be on the CPU, while EEVDF
gives out timeslices (quanta)

– EEVDF concentrates on real-time (SCHED_RR-alike) workloads where they
know the length of work units – while CFS does not need any knowledge
about ‘future work’, it measures ‘past behavior’ and makes its
decisions based on that. So CFS is purely ‘history-based’.

– thus in CFS there’s no concept of “deadline” either (the ‘D’ from
EEVDF).

– EEVDF seems to be calculating timeslices in units of milliseconds,
while CFS follows a very strict ‘precise’ accounting scheme on the
nanoseconds scale.

– the EEVDF paper is also silent on SMP issues.

– it seems EEVDF never existed as a kernel scheduler, it was a
user-space prototype under FreeBSD with simulated workloads. (have
they released that code at all?).

The main common ground seems to be that both CFS and EEVDF share the
view that the central metric is ‘virtual time’ proportional to the load
of the CPU (called the ‘fair clock’ in CFS) – but even for this the
details of the actual mechanism differ: EEVDF uses 1/N while CFS (since
-v8) uses a precise, smoothed and weighted load average that is close to
(and reuses portions of) Peter Williams’s load metric used in smp-nice.

The EEVDF mechanism could perhaps be more appropriate for real-time
systems (the main target of their paper), while the CFS one i believe is
more appropriate for general purpose workloads.

So i’d say there’s more in common between SD and CFS than between EEVDF
and CFS.

So … it would certainly be interesting to try the EEVDF paper based on
CFS (or whatever way you’d like to try it) and turn the EEVDF paper into
a real kernel scheduler – the two mechanisms are quite dissimilar and
they could behave wildly differently on various workloads. Depending on
test results we could use bits of EEVDF’s approaches in CFS too, if it
manages to out-schedule CFS 🙂

(your observation about CFS’s fork handling is correct nevertheless!)

Ingo

这个链接挺好:http://oliveryang.net
另外参考LKML:https://lkml.org/lkml/2007/5/1/632

分类: 内核 标签:

【转载】64位和32位的寄存器和汇编的比较

2017年2月25日 评论已被关闭

转载于http://blog.csdn.net/qq_29343201/article/details/51278798

%image_alt%

64位(新增)汇编指令的不同

%image_alt%

mov指令和push pop扩展了movq系列的mov和pushq以及popq用来操作quad word。

注意:movabsq不是32位的扩展,是纯新增的指令。用来将一个64位的字面值直接存到一个64位寄存器中。因为movq只能将32位的值存入,所以新增了这样一条指令。

%image_alt%

%image_alt%

顺带提一个小问题,64位的汇编代码在ret之前可能会加一句rep,这里的rep没有实际意义,只是出于amd处理器的原因,避免jmp所到达的地方直接就是ret,这样会使得处理器运行更快一些。

过程(函数)调用的不同

  • 参数通过寄存器传递(见前文)
  • callq 在栈里存放一个8位的返回地址
  • 许多函数不再有栈帧,只有无法将所有本地变量放在寄存器里的才会在栈上分配空间。
  • 函数可以获取到栈至多128字节的空间。这样函数就可以在不更改栈指针的情况下在栈上存储信息(也就是说,可以提前用rsp以下的128字节空间,这段空间被称为red zone,在x86-64里,时刻可用)
  • 不再有栈帧指针。现在栈的位置和栈指针相关。大多数函数在调用的一开始就分配全部所需栈空间,之后保持栈指针不改变。
  • 一些寄存器被设计成为被调用者-存储的寄存器。这些必须在需要改变他们值的时候存储他们并且之后恢复他们。

    参数传递的不同

  • 6个寄存器用来传递参数(见前文)
  • 剩下的寄存器按照之前的方式传递(不过是与rsp相关了,ebp不再作为栈帧指针,并且从rsp开始第7个参数,rsp+8开始第8个,以此类推)
  • 调用时,rsp向下移动8位(存入返回地址),寄存器参数无影响,第7个及之后的参数现在则是从rsp+8开始第7个,rsp+16开始第8个,以此类推

栈帧的不同

很多情况下不再需要栈帧,比如在没有调用别的函数,且寄存器足以存储参数,那么就只需要存储返回地址即可。
需要栈帧的情况:

  • 本地变量太多,寄存器不够
  • 一些本地变量是数组或结构体
  • 函数使用了取地址操作符来计算一个本地变量的地址
  • 函数必须用栈传送一些参数给另外一个函数
  • 函数需要保存一些由被调用者存储的寄存器的状态(以便于恢复)

但是现在的栈帧经常是固定大小的,在函数调用的最开始就被设定,在整个调用期间,栈顶指针保持不变,这样就可以通过对其再加上偏移量来对相应的值进行操作,于是EBP就不再需要作为栈帧指针了。

虽然很多时候我们认为没有“栈帧”,但是每次函数调用都一定有一个返回地址被压栈,我们可以也认为这一个地址就是一个“栈帧”,因为它也保存了调用者的状态。

 

转载于http://blog.csdn.net/qq_29343201/article/details/51278798

64位ASM教程参考:

http://cs.lmu.edu/~ray/notes/nasmtutorial/

http://asmtutor.com/#lesson1

https://cs.nyu.edu/courses/fall11/CSCI-GA.2130-001/x64-intro.pdf

分类: 未分类 标签:

扯淡,关于如何跟踪理解子系统

2017年1月31日 评论已被关闭

cfs调度器在2007年就被造了出来,如果没有在那个时候正好跟上她,也就是说在2007年她刚出来的时候就有能力理解她,而是到了五年后,甚至七八年后才开始想要真正意义上的理解调度机制,没关系。不过可能要多花些时间,我还走在这样的路上。

就调度器而言,在某个非常革命性的patch出现时,可以以此切入,着重关注这个patch,因为会有很多牛人对这样的patch有讨论,这些讨论就是需要吸收的原料,而书已经不能用来理解系统了。此时即使猛地跟踪这样的patch还是不能整体很好的理解,因为,系统发展的时间太长了,此时应该要从她被造出来开始切入,然后对一切的有关patch及讨论进行理解,这只能默默地,也可以参与到别人的讨论,越默默,越到位,反复,实验(我忽略)。哪天就可以不知不觉地进入那个领地,哈,纯属扯淡,新年快乐…

分类: 内核 标签:

好起来,保重

2016年2月29日 1 条评论

一直没有在此地记录一些东西。打算慢慢在此share一些自己对内核社区(尤其是调度器方面)的追踪。也许不会涉及很多细节,也许由于自己实验能力的缺陷不能给予说明,不过,尽量share一些所得的资料等等。

说到调度器,其实上层应用多线程等等会用到相似的思维方式,就是并发。这是真正理解内核的一种比较关键的技术。目前我能了解的关于内核并发的资料是有Paul E.McKenney在2015年汇编成的一本书籍:《Is Parallel Programming Hard, And, If So, What Can You Do About It?》。目前我也在study。

通过这一段时间对scheduler mail list的追踪和学习,发现他们的贡献者大多数是cpu厂商和服务器等的开发者,有来自德国,荷兰,英国,比利时,韩国,当然还有我们中国。目前我了解的国内在scheduler中活跃的开发者有Michael Wang(IBM), Alex Shi(Intel)等。

好,不多说。最后有一个希望:好起来,保重。一切都在变,一切又都在stay。

分类: 内核 标签: