Contents List of Figures iv List of Tables vii

EPOS: Embedded Parallel Operating System

Ant?nio Augusto Medeiros Fr¨hlich o o Federal University of Santa Catarina Computer Science Department PO Box 476 88040-900 Florian?polis - SC o Brazil

March 2002

List of Figures List of Tables 1 A Bit of History 2 Fundamentals 3 System Abstractions 3.1 3.1.1 3.1.2 3.2 3.2.1 3.2.2 3.2.3 3.3 3.4 3.3.1 3.4.1 3.4.2 3.4.3 3.4.4 3.5 3.6 iv vii 7 9 11

Memory Management . . . . . . . . . . . . . . . . . . . . . . . 12 Memory Segments . . . . . . . . . . . . . . . . . . . . 13 Address Spaces . . . . . . . . . . . . . . . . . . . . . . 14 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Threads . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Processor Schedulers . . . . . . . . . . . . . . . . . . . 19 Synchronizers . . . . . . . . . . . . . . . . . . . . . . . 24 Communicators . . . . . . . . . . . . . . . . . . . . . . 26 Channels . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Networks . . . . . . . . . . . . . . . . . . . . . . . . . 31 Message Envelopes . . . . . . . . . . . . . . . . . . . . 32

Process Management . . . . . . . . . . . . . . . . . . . . . . . 16

Process Coordination . . . . . . . . . . . . . . . . . . . . . . . 23 Inter-Process Communication . . . . . . . . . . . . . . . . . . 26

Time Management . . . . . . . . . . . . . . . . . . . . . . . . 33 I/O Management . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6.1 3.6.2 Buses . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Devices . . . . . . . . . . . . . . . . . . . . . . . . . . 35


3.6.3 3.7 3.8

Interrupt Handlers . . . . . . . . . . . . . . . . . . . . 36

External Abstractions . . . . . . . . . . . . . . . . . . . . . . 38 Summary of EPOS Abstractions . . . . . . . . . . . . . . . . . 39 42

4 Scenario Aspects 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9

Identi?cation . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Atomicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Remote Invocation . . . . . . . . . . . . . . . . . . . . . . . . 51 Debugging and Pro?ling . . . . . . . . . . . . . . . . . . . . . 54 Summary of EPOS Scenario Aspects . . . . . . . . . . . . . . 55 57

5 System Architectures 5.1 5.1.1 5.1.2 5.2 5.2.1 5.2.2 5.3 5.3.1 5.3.2

Component Framework . . . . . . . . . . . . . . . . . . . . . . 57 Framework Metaprogram . . . . . . . . . . . . . . . . . 58 Composition Rules . . . . . . . . . . . . . . . . . . . . 62 The Setup Utility . . . . . . . . . . . . . . . . . . . . . 67 Hardware Mediators . . . . . . . . . . . . . . . . . . . 68 The Init Utility . . . . . . . . . . . . . . . . . . . . . . 70 System Organization . . . . . . . . . . . . . . . . . . . 72 74 79

Portability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6 Automatic Con?guration 7 EPOS Implementation for the SNOW Cluster


7.1 7.2

Why a Cluster of Workstations? . . . . . . . . . . . . . . . . . 79 The SNOW Cluster . . . . . . . . . . . . . . . . . . . . . . . . 81 7.2.1 7.2.2 Processing Nodes . . . . . . . . . . . . . . . . . . . . . 81 Interconnects . . . . . . . . . . . . . . . . . . . . . . . 83 The Myrinet Network Abstraction . . . . . . . . . . . . 87 Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 98 101


EPOS Implementation . . . . . . . . . . . . . . . . . . . . . . 85 7.3.1 7.3.2 7.3.3

8 Summary Bibliography


List of Figures
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Groups of families of abstraction in Epos. . . . . . . . . . . . 12 Families of abstractions concerning memory management in Epos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Epos family of memory segments. . . . . . . . . . . . . . . . . 14 Epos family of address spaces. . . . . . . . . . . . . . . . . . 15 Families of abstractions concerning process management in Epos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Epos family of tasks. . . . . . . . . . . . . . . . . . . . . . . . 18 Epos family of threads. . . . . . . . . . . . . . . . . . . . . . 18 Epos family of processor scheduling policies. . . . . . . . . . . 20 Families of abstractions concerning process coordination in Epos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Epos family of synchronizers. . . . . . . . . . . . . . . . . . . 24 Families of abstractions concerning inter-process communication in Epos. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Epos family of communicators. . . . . . . . . . . . . . . . . . 27 Epos family of communication channels. . . . . . . . . . . . . 29 The e?ect of con?gurable features buffering and grouping on communication channels. . . . . . . . . . . . . . . . . . . . 30 Epos family of networks. . . . . . . . . . . . . . . . . . . . . . 31 Epos family of message envelopes. . . . . . . . . . . . . . . . 33 Epos family of timers. . . . . . . . . . . . . . . . . . . . . . . 34 Families of abstractions concerning I/O management in Epos. 34 Epos family of buses. . . . . . . . . . . . . . . . . . . . . . . . 35 Epos family of device abstractions. . . . . . . . . . . . . . . . 36 Epos family of interrupt handlers. . . . . . . . . . . . . . . . 37 Access to external abstractions in Epos. . . . . . . . . . . . . 38


23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Epos family of identi?cation aspects. . . . . . . . . . . . . . . 43 Epos family of sharing aspects. . . . . . . . . . . . . . . . . . 45 Epos family of allocation aspects. . . . . . . . . . . . . . . . . 47 Epos family of protection aspects. . . . . . . . . . . . . . . . 48 Epos family of timing aspects. . . . . . . . . . . . . . . . . . 49 Epos family of atomicity aspects. . . . . . . . . . . . . . . . . 51 Epos remote invocation scenario aspect. . . . . . . . . . . . . 52 A remote object invocation in the Remote scenario. . . . . . . 53 The Remote scenario aspect adapter. . . . . . . . . . . . . . . 53 Epos family of debugging and pro?ling aspects. . . . . . . . . 54 A top-view of Epos component framework metaprogram. . . . 59 Epos framework: the Handle element. . . . . . . . . . . . . . 60 Epos framework: the Stub element. . . . . . . . . . . . . . . . 60 Epos framework: Proxy and Agent elements. . . . . . . . . . 61 Epos framework: the Adapter element. . . . . . . . . . . . . . 62 Epos framework: the Scenario element. . . . . . . . . . . . . 63 Epos hardware mediator Node. . . . . . . . . . . . . . . . . . 68 Epos hardware mediator CPU. . . . . . . . . . . . . . . . . . . 69 An overview of Epos initialization. . . . . . . . . . . . . . . . 71 Epos organizations: (a) embedded-in-the-application and (b) ?-kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Epos automatic con?guration tools. . . . . . . . . . . . . . . 75 The dinning philosophers problem in Epos. . . . . . . . . . . 76 The Snow cluster from the perspective of Epos. . . . . . . . 82 A Snow node from the perspective of Epos. . . . . . . . . . . 82 The organization of Snow networks. . . . . . . . . . . . . . . 84 The Myrinet network interface card from the perspective of Epos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84


49 50 51 52 53 54 55

Epos Myrinet abstraction. . . . . . . . . . . . . . . . . . . . . 88 Activity diagram of Epos Myrinet sender process. . . . . . . 89 A message exchange with Myrinet. . . . . . . . . . . . . . . 91 One-way bandwidth achieved by the Myrinet Network abstraction on Snow. . . . . . . . . . . . . . . . . . . . . . . . . 92 One-way latency measured for the Myrinet Network abstraction on Snow. . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Fragments of Epos catalog of composition rules. . . . . . . . . 96 Fragments of Epos catalog of system interfaces. . . . . . . . . 97


List of Tables
1 2 3 Epos abstractions (part I). . . . . . . . . . . . . . . . . . . . 40 Epos abstractions (part II). . . . . . . . . . . . . . . . . . . . 41 Epos scenario aspects. . . . . . . . . . . . . . . . . . . . . . . 56



A Bit of History

The Epos system was born in 1997 at the Research Institute for Computer Architecture and Software Engineering (FIRST) of the German National Research Center for Information Technology (GMD) as a project to experiment with the concepts and mechanisms of application-oriented system design. Indeed, Epos and application-oriented system design cannot be disassociated, since both have been evolving together from the very beginning as the research substratum for this dissertation. The acronym Epos1 stands for Embedded Parallel Operating System. It was coined considering the main research goal established for the early system: to embed the operating system in a parallel application. The strategy followed to achieve this goal consisted in modeling the corresponding domain entities as a set of reusable and adaptable components, and developing a mechanism to allow parallel applications to easily specify constraints to guide the arrangement of these components in a functioning system. As the system evolved, it became clear that this strategy could be applied to a broader universe of applications. Concerning design and organization, Epos is inherently tied with dedicated computing and static con?gurability, but whether a platform is dedicated to an application temporarily (like in traditional parallel systems) or permanently (like in most embedded systems) does not play a signi?cant role. Hence, Embedded Parallel Operating System can also be interpreted as a system that targets both embedded and parallel applications. Epos owes much of its philosophy to the Peace system [SP94a], from which it inherited the notion that “generic” and “optimal” are adjectives that cannot be simultaneously applied to the same operating system, besides a rich perception of family-based design. Epos implementation for the Intel ix86 architecture reuses some of the strategies adopted in Aboelha [FAPS96], a former research operating system developed by the author. However, the design of Aboelha did not promote reuse, so these strategies had to be
The German word “epos” means “epic” in English, and, although suggesting that Epos is an epic—about an academic operating system (a Portuguese sailor) that had to traverse a sea of di?culties (of standardized application program interfaces), passing by Taprobana (the common-sense de?ned by Unix and Microsoft Windows), to reach India (modern software engineering)—would be too pretentious, the metaphors about the epic “Os Lusiadas” [dC72] are left for the delight of readers (application programmers).


completely remodeled for Epos.




Epos was designed following the guidelines of application-oriented system design described in [Fr¨01]. Indeed, Epos has been created to experiment o with application-oriented system design, which in turn has been evolving to contemplate design issues arisen by Epos. Consequently, the design of Epos is intrinsically application-oriented. The domain envisioned by Epos is that of high-performance dedicated computing, which comprises applications that, besides running with exclusivity on the respective platforms, require an e?cient management of resources. It is important to notice that both adjectives, dedicated and highperformance, can be interpreted subjectively. A dedicated system does not need to be permanently dedicated, it can also be scheduled to serve diverse applications, each of which gaining absolute control over the platform for a certain time. Similarly, a high-performance system does not need to be always associated to billions of ?oating point operations. In what concerns the operating system, a small, embedded application that demands the absolute majority of the resources available in the platform to accomplish its duties calls for a high-performance operating system just like a parallel application does. Aiming at supporting applications in the high-performance dedicated computing domain, Epos was designed pursuing the following goals: ? Functionality: Epos shall deliver the functionality necessary to support the execution of high-performance dedicated applications. ? Customizability: Epos shall be highly customizable, so that system instances can be tailored to speci?c applications; whenever possible, system tailoring shall succeed automatically. ? E?ciency: Epos shall make resources available to applications with the lowest possible overhead, besides being itself thrifty. The domain of high-performance dedicated computing aimed by Epos is under constant evolution, steadily assimilating technological innovations. As a product of domain engineering, Epos had to consider the dynamics of the envisioned domain, setting up an open and continuous domain analysis 9

process that allows new entities to be included in the design as they are incorporated by the domain. Families of abstractions were thoroughly modeled, leaving room for upcoming members. Even the hypothesis of completely new families being added to the system has been taken in consideration. The extreme scalability implied by these goals could only be achieved with a meticulous separation of concerns. As suggested by application-oriented system design, abstractions were modeled independently of each other, of execution scenario aspects, and of component frameworks. Consequently, Epos abstractions can be extensively reused in a variety of scenarios. Furthermore, the framework can be adjusted to accommodate forthcoming abstractions, or to build particular software architectures, without a?ecting existing abstractions. In this way, Epos could develop into an application-oriented operating system. For the forthcoming discussion of Epos design, however, it is important to bear in mind that the ultimate goal of Epos is the validation of applicationoriented system design concepts and techniques. Therefore, some complex domain entities were speci?ed only to the extent of corroborating applicationoriented system design. Such entities would have to be further re?ned in order to sustain an actual implementation.



System Abstractions

Epos families of abstractions result from the application-oriented decomposition of the high-performance dedicated computing domain. Many of the entities in this domain that concern the operating system are conventions de?ned by computer scientists and system developers. Therefore, Epos resorted to traditional operating system books (e.g. Tanenbaum [Tan92] and Silberschatz [SGP98]) as a reference to the operating system domain vocabulary. This decision helped to make Epos user-friendlier, since its abstractions have been named after the classic concepts they represent, giving many of them a self-explanatory character. Systems that assign ethereal names to abstractions impose extra di?culties to users, which ?rst have to discover the name of abstractions they want to use. Yet, many operating system abstractions stem from physical devices. These abstractions are conventionally modeled considering the role physical devices play in the operating system. However, as an application-oriented operating system, Epos gives priority to the role they play in application programs. For example, the timer physical resource is delivered as an abstraction capable of generating alarm events for applications, though it is internally reused as the operating system time-keeping device. As proposed by application-oriented system design, Epos abstractions have been modeled independently of execution scenario aspects and speci?c system architectures. Consequently, Epos abstractions could reach a degree of reusability that allows, for instance, the same abstraction of the Thread family to be deployed in a single- or multitasking environment, as part of a ?-kernel or completely embedded in an application. The protection barrier the eventually separates applications from each other and from the operating system is not modeled in the context of abstractions, but as an architectural aspect of Epos framework. Furthermore, the separation of scenario aspects from abstractions considerably reduced the number of components in Epos repository, whereas numerous scenario-speci?c specializations could be suppressed. Figure 1 shows a top-level representation of Epos families of abstractions. Families were organized in six large groups: memory management, process management, process coordination, inter-process communication, time management, and I/O management. Each one of these groups will be subse-







Communi? cation

Coordi? nation

Figure 1: Groups of families of abstraction in Epos. quently described in this section, while applicable scenario aspects will be described in the next section. Exception is made to private or specialized scenario aspects, which are described in the scope of the family (or abstraction) to which they regard. Non-portable abstractions, such as node and processor, were modeled in Epos as hardware mediators. These abstractions will be separately described in section 5.2.2. The in?ated interface of each family of abstractions will not be explicitly presented, since this would bring little contribution to the reasoning on Epos design in comparison to explosion of details it would trigger. Nevertheless, the category of each family of abstractions—whether uniform, incremental, combined, or dissociated—will be identi?ed, thus revealing how their in?ated interfaces have been speci?ed.


Memory Management

Memory management in Epos is accomplished by the families of abstractions shown in ?gure 2. The main memory available in the system is delivered to applications through the concept of memory segment realized by the Segment family. In order to be manipulated, a segment must ?rst be attached to the address space of a process, which is realized by the Address Space family. Besides supporting this logical concept, the family of address spaces is the one that ultimately implements a memory management policy for the system, 12

since it controls the allocation and mapping of the physical memory that e?ectively makes up a memory segment.
Address Space



Figure 2: Families of abstractions concerning memory management in Epos. In the context of application-oriented operating systems, virtual memory is not an autonomous domain entity, but a workaround to enable the execution of applications with memory requirements that exceed the available physical memory. That is, applications do not directly demand virtual memory, they demand memory. However, currently available virtual memory mechanisms are mostly inadequate to support applications in the domain of high-performance dedicated computing, ?rstly because many dedicated systems do not count on a secondary storage to extend main memory, but also because such mechanisms do not sustain high-performance. Therefore, no virtual memory strategy has been modeled for Epos at this time. 3.1.1 Memory Segments

A memory segment is nothing but a chunk of main memory that can be used by applications to store arbitrary code and data. The physical memory corresponding to a memory segment is managed by the family of address spaces, thus keeping the Segment family independent of a memory management policy. Figure 3 shows Epos family of memory segments, which was designed in an incremental fashion. The common aspects of the family are encapsulated in a package that is reused by member Static, which is in turn extended by member Dynamic. The former is responsible for memory segments that cannot be resized nor can have their properties modi?ed after having been attached, while the latter allows for these operations. These abstractions could only be modeled as an incremental family because the e?ective allocation and mapping of physical memory to build a 13

segment is carried out by the family of address spaces. Otherwise, di?erences in allocation policy would have made this design impractical. Furthermore, the family was reduced to only two members because sharing and protection were model as scenario aspects2 . The Shared scenario aspect pro?ts from the memory mapping capabilities of the Address Space family to support the sharing of segments among processes, while the Protected aspect uses the mapping modes provided by that family to protect segments against detrimental access. Both abstractions in this family are mutually exclusive, since their dependencies on the family of address spaces, itself mutually exclusive, cannot be accommodated simultaneously. In order to allow resizing, member Dynamic requires a non-contiguous mapping of physical memory, which is only provided by the Paged member of the Address Space family. Member Static has no requirements in this regard and is able to operate properly with any kind of address space.
shared Segment protected Address_Space




Figure 3: Epos family of memory segments.


Address Spaces

The address space of a process corresponds to the range of memory addresses it can access. In order to support multitasking, most hardware architectures include a Memory Management Unit (MMU) that enables the separation of logical and physical address spaces. A process is thus said to have a logical address space, which is mapped into the physical memory address space.
The Shared and Protected scenario aspects modeled here are specializations of global scenario aspects described in section 4.


The MMU supports a non-contiguous mapping between these address spaces, enabling a rational use of memory by multiple processes. Nevertheless, many dedicated applications are carried out by a single process (on each node) and will not bene?t from this mechanism. Moreover, some dedicated applications are executed on platforms that do not feature a MMU. Therefore, an address space abstraction for the dedicated computing domain must consider both perspectives: with and without logical address mapping. The concept of address space is realized in Epos by the dissociated family of mutually exclusive abstractions shown in ?gure 4. These abstractions are not intended to be directly used by application programmers. They are implicitly invoked as a result of the memory and process management policies in force. The Flat member is used when a single process (possibly multithreaded) executes alone in the system. It implements a policy that pre-allocates all the memory available in the system to this single process in a contiguous way, dispensing with a MMU and allowing for an e?cient application-level memory allocator (malloc). This allocator would still have to keep track of the stretches of memory currently being used, but would no longer need to perform physical memory allocation and mapping. The contiguous property of the Flat address space also brings an important advantage to processes that perform user-level I/O: since logical and physical address spaces match, Direct Memory Access (DMA) transactions can be issued with logical addresses, thus eliminating translations and intermediate copies.






Figure 4: Epos family of address spaces. When the policy de?ned by the Flat abstraction is in force, the family of memory segments becomes an informative connotation, since there is 15

no practical reason for a process to create a memory segment if it already possesses the whole memory and there are no other processes with which to share that segment. The Paged member of the Address Space family supports the mapping of logical address spaces to main memory through paging [KHPS61, HK61]. This strategy allows multiple processes to coexist in a rational way, since memory is allocated and mapped to their address spaces on demand and is reclaimed when they terminate. At least two other alternatives are there to handle memory management in multitasking scenarios: segmentation [Den65] and paged segmentation [Org72]. However, the hardware mechanisms necessary to support them are mostly unavailable in contemporary computers3 , so they were not speci?ed in Epos. A set of memory mapping modes (e.g. read-only, cached, etc) has been modeled as con?gurable features for the Address Space family. These modes have defaults that apply to the whole address space, but can usually be dynamically overridden for each memory segment. A set of allocation algorithms has been modeled as a mutually exclusive con?gurable feature for the family.


Process Management

Process management in Epos is delegated to the families of abstractions shown in ?gure 5. The concept of process is delivered to applications by two abstractions: Task and Thread. If a process is thought of as a program in execution, then a task corresponds to the activities speci?ed in the program, while threads are the entities that perform such activities. This separation of concerns enables a task to be simultaneously executed by multiple threads. Threads are locally scheduled for execution by a CPU Scheduler. Global schedulers to promote load balancing for parallel applications were not yet speci?ed for Epos. They shall inaugurate a new family of abstractions in the future.
The Intel ix86 architecture [Int95a] uses segmentation or segmentation plus paging, but it can be con?gured to emulate pure paging.





CPU Scheduler

Figure 5: Families of abstractions concerning process management in Epos. 3.2.1 Tasks

A task comprises the code, global data, and resources of a process, thus, being passively shared by its threads. A task constitutes the unit of distribution of a parallel application. Tasks are realized in Epos by the Task family of abstractions depicted in ?gure 6, which was modeled as a dissociated family with two mutually exclusive members, namely Exclusive and Mutual. Abstractions in this family were modeled to be deployed independently of a particular thread abstraction, so both members can be used with a single or with multiple threads. Two instances of the Segment abstraction are allocated to hold respectively the code and the data associated with a Task. The Exclusive member of the Task family was conceived to support a single process that has absolute control over the resources available in the platform. Hence, it pairs up with the Flat member of the Address Space family, which implements a single contiguous address space. An exclusive task is set up in such a way that, when the corresponding process begins execution, all needed physical resources, including memory, are already allocated to it. This is one of the factors that enable Epos to assume the embedded-into-the-application architecture, in which all system services are implemented at user-level (see section 5.3.2 for more details about this architecture). The Mutual member of the Task family pairs up with the Paged member of the Address Space family to accomplish a multitasking environment in which tasks share available resources that are allocated on demand. The scenario aspects that concern to this family (e.g. identi?cation, location, protection, etc) have been modeled as global aspects, for they also apply to other abstractions. They will be described in section 4. 17

Segment Task Thread





Figure 6: Epos family of tasks. 3.2.2 Threads

Threads “execute a task”, thus they correspond to the active part of a process. The resources of a task, including its data segment, are shared by all of its threads, but each thread has its own local data segment (stack ) and execution context. A thread constitutes the unit of execution of an application. Threads are realized in Epos by the Thread family of abstractions depicted in ?gure 7. This family was modeled in an incremental fashion, with commonalities being captured in a package that is reused by member Exclusive and inherited by other family members.
Segment Thread Task






Figure 7: Epos family of threads. The Exclusive member of the Thread family was conceived to support a single-threaded process that executes alone in the system. It depends on the Exclusive member of the Task family, which ultimately shapes the single-


tasking environment in which exclusive threads perform. Though primitive4 , this process model is su?cient to support a reasonable number of dedicated applications, and it has expressive advantages over other models in what regards performance. Since all resources available in the system are preallocated to this unique process during system initialization, and also because no scheduling is necessary, this process model can be achieved without any run-time overhead, i.e., all physical resources, including CPU time, are entirely available to the application process. Support for multithreading is the subsequent increment to the Thread family. It is realized by the Cooperative family member, which covers the mechanisms necessary for multiple threads to coexist in the system. This family member was conceived to be independent of the task model in force, so no di?erentiation is made between threads that execute the same task and threads of single-threaded tasks. As stated before, the unit of execution in Epos is the thread, not the task. The Cooperative thread abstraction addresses the issues relative to multiple threads that share the processor in a collaborative way, dispensing with processor scheduling. In this scenario, a thread voluntarily relinquishes the processor in favor of another one; hence, one could say that cooperative threads are self-scheduling. The Concurrent abstraction extends the Thread family so that threads are executed concurrently. This family member relies on a family of processor schedulers, each implementing a di?erent scheduling policy, in order to multiplex the processor among threads. According to the scheduling policy in force, the scheduler can be invoked regularly to proof whether the running thread still ful?lls the necessary conditions to seize the processor, otherwise preempting it in favor of another thread. Being an extension of cooperative threads, concurrent threads also have the possibility to voluntarily relinquish the processor, either in bene?t of another thread, or causing the scheduler to be invoked. 3.2.3 Processor Schedulers

Processor scheduling is deployed in order to multiplex the processor for concurrent thread execution. Therefore, Epos only features a processor schedOnly three operations are valid on objects of type Exclusive Thread: self-referencing, status reporting, and termination.


uler if the Concurrent member of the Thread family is in use. In this case, the scheduler is realized by the entities shown in ?gure 8. The CPU Scheduler abstraction realizes the mechanisms that are necessary to suspend and resume the execution of threads (i.e. context switching), invoking Policy to select a thread to occupy the processor whenever it becomes idle. That is, Epos does not actually feature a family of schedulers, but a family of scheduling policies that are enforced by a single CPU Scheduler abstraction.
preemption idle_waiting

busy_waiting affinity









Figure 8: Epos family of processor scheduling policies. The isolation of scheduling policies was achieved by modeling Policy as a polymorphic uniform family of abstractions. In this way, CPU Scheduler can invoke Policy without having to consider which policy is currently in force, and policies can be changed at run-time. In order to con?gure the scheduler, one or more policies are selected at generation-time to be deployed at run-time. Since the processor scheduler is an internal entity, invisible to applications, the selection of policies at run-time is performed through operations provided by the Concurrent thread abstraction. If one of the scheduling policies selected for a given system con?guration is preemptive, the preemption con?gurable feature is enabled, causing the scheduler to be periodically activated by the Alarm abstraction [section 3.5] to check for policy compliance. A thread that voluntarily relinquishes the processor also causes the scheduler to be activated. If the leaving thread indicates a candidate to replace it on the processor, that thread temporarily 20

assumes the scheduling characteristics of the leaving one. For example, if static priority is the policy in force, then the priority of the designated thread is raised to equal that of the designator until the next scheduling event. Otherwise, a thread can relinquish the processor leaving the decision of which thread shall substitute it to the policy enforcer. A third situation in which the scheduler can be activated occurs when a concurrent thread goes waiting for an event, for example I/O completion. The traditional approach of generic multitasking systems to deal with this situation is to block the waiting thread and invoke the scheduler to select another thread to occupy the processor. In this scheme, blocked threads are said to be “idle waiting”. Nevertheless, this solution is not so straightforward for dedicated systems, which may be better served if “busy waiting” is deployed. Several high-speed peripherals, including high-speed networks, may imply in busy waiting cycles that are shorter than the time needed to block the running thread and reschedule the processor. Therefore, busy and idle waiting have been modeled as con?gurable features that can be selected according to application needs. The CPU Scheduler abstraction is also able to perform processor scheduling in a Symmetric MultiProcessor (SMP) environment [JS80, TG89]. Regardless of the number of processors, a single queue of threads ready to execute is maintained. When a processor becomes idle, CPU Scheduler is invoked by that processor to select a new thread to execute. Race conditions originated from parallel invocations of the scheduler are prevented by activating one of the Locked members of the Atomic scenario aspect (see section 4.6)5 . In multiprocessor environments, the con?gurable feature affinity can be enabled to request CPU Scheduler to consider processor a?nity [TTG93] while choosing a thread to occupy an idle processor, so that threads that were formerly executing on that processor are given preference. If the scheduler cannot ?nd a thread under the a?nity criterion, a thread formerly executed in another processor is selected. Processor a?nity is useful with short-term scheduling to promote cache locality, since a thread returning to a processor it has recently seized may still ?nd part of its working-set of memory on that
5 The dependency of CPU Scheduler on Atomic is externally expressed as a composition rule [section 5.1.2]. It can be easily suppressed for eventual scheduling algorithms that are able to cope with non-blocking synchronization.


processor’s cache. The FCFS member of the Policy family realizes the ?rst-come-?rst-served scheduling policy. When concurrent threads operate under this policy, they acquire several of the characteristics of a cooperative thread, for a thread only releases the processor on its own free will. Nevertheless, if the idle waiting con?gurable feature is enabled, a thread going into a waiting state releases the processor for the next coming thread, retrieving it when it returns to the ready state. The RR member of the Policy family realizes the round-robin processor scheduling policy. With this policy, the processor is multiplexed on time among threads, so that each thread is given an identical fraction of processing time (time-slice), after which it is preempted and sent to the end of the ready queue. The Static Priority member realizes a processor scheduling policy based on statically assigned priorities [ZRS87]. When a thread is created, it is assigned a priority that is used for scheduling decisions: higher priority threads are executed ?rst. Although applications are provided with means to rede?ne the priority of a thread afterwards, the scheduler itself never takes this initiative. This policy in?uences the handling of asynchronous events in Epos [section 3.6.3], since an event only causes the processor to be preempted if the thread assigned to handle it has a higher priority than the one currently being executed. The Dynamic Priority member extends Static Priority and RR to accomplish a policy that automatically adjusts the priority of threads to promote interactiveness. The strategy used to build this policy consists in allowing a thread to seize the processor for a certain time (usually greater than the time-slice de?ned for round-robin), after which it is preempted and its priority is recalculated. Priorities are recalculated as a function of the processing time e?ectively used, so threads that voluntarily relinquish the processor before their time-slice is over have their priority increased, while those that tend to monopolize the processor have their priority decreased. This is basically the scheduling policy adopted in the Unix System V operating system [Bac87]. It promotes interactiveness because interaction with users is achieved via I/O operations, causing interactive threads to release the processor prematurely and consequently raising their priority6 .
In order to achieve this e?ect in Epos, the idle waiting con?gurable feature must be enabled.


In principle, a single processor scheduling policy is in force at a time. However, the Multilevel member of the Policy family supports scheduling policies to be combined by gathering threads in groups and applying distinct inter- and intra-group policies. For example, threads could be gathered in three groups scheduled with static priorities (i.e. the inter-group policy). The group with the highest priority could again deploy static priorities to support a set of real-time threads; the intermediate priority group, consisting of interactive threads, could deploy round-robin; and the group with the lowest priority, comprised of batch threads, could be subject to FCFS. Although such a complex scheduling scenario is improbable for a dedicated system, some simpler combinations of scheduling policies may be useful. User Defined, the last member of the Policy family of abstractions, allows applications to specify their own processor scheduling policy. It is accomplished by a multilevel scheme that assigns the highest priority to an application thread that acts as the scheduler, while all other threads are assigned the same lower (than the scheduler thread) priority. The scheduler thread implements the desired policy and relinquishes the processor in bene?t of the selected thread. A preemptive user-level scheduler [ABLL92] can be accomplished in a variety of ways, the simplest one implies in de?ning a round-robin policy for the high priority group of threads, which comprises only the scheduler thread. This would grant that the scheduler thread regains control over the processor on a regular basis. It can then decide to reschedule the thread that originally occupied the processor, or select another one. Nevertheless, the User Defined policy can considerably a?ect performance if used to perform short-term scheduling, since it doubles the scheduling overhead.


Process Coordination

The mechanisms available in Epos to coordinate the parallel execution of processes are realized by the Synchronizer and Communicator families of abstractions shown in ?gure 9. However, the Communicator family delivers coordination as a side-e?ect of inter-process communication, and hence will be described later in section 3.4. Nevertheless, it is important to observe that every time two processes exchange a (possibly empty) message, they implicitly exchange status information that can be used for coordination purposes. 23

For example, when a process sends a message to another, it signalizes the recipient process that it is ready with the computation needed to produce that message. The set of possibilities to indirectly coordinate processes through message exchange grows considerably if inter-process communication implies in a rendezvous between the communicating processes. In this case, sender and receiver are implicitly synchronized on each message exchange.
Coordi? nation

Synchro? nizer

Commu? nicator

Figure 9: Families of abstractions concerning process coordination in Epos.



Synchronizers are used to avoid race conditions during the execution of parallel programs. A race condition occurs when a thread accesses a piece of data that is being modi?ed by another thread, obtaining an intermediate value and potentially corrupting that piece of data. A synchronizer prevents such race conditions by trapping sensible data in a critical section, which is exclusively executed by a single thread at a time. Epos dissociated family of synchronizers is depicted in ?gure 10.
Synchronizer remote atomic Semaphore



Figure 10: Epos family of synchronizers. 24

The Mutex member of the Synchronizer family implements a simple mutual exclusion device that supplies two atomic operations: lock and unlock. Invoking lock on a mutex locks it, so subsequent invocations cause the calling threads to wait. When a thread invokes the operation unlock on a mutex and there are threads waiting on it, the ?rst thread put to wait is allowed to continue execution, immediately locking the mutex. If no threads are waiting, the unlock operation has no e?ect; it is not accumulated to match forthcoming lock operations. The mutex mechanism is sometimes called a binary semaphore. The Semaphore member of the Synchronizer family realizes a semaphore variable [Dij65]. A semaphore variable is an integer variable whose value can only be manipulated indirectly through the atomic operations p and v. Operation p atomically decrements the value of a semaphore, and operating v atomically increments it. Invoking p on a semaphore whose value is less than or equal to zero causes the thread to wait until the value becomes positive again. A semaphore initialized with “1” acquires the initial semantics of a mutex, but it can be initialized with any other value to accomplish other synchronization semantics. The Condition member of the Synchronizer family realizes a system abstraction inspired on the condition variable language concept [Hoa74], which allows a thread to wait for a predicate on shared data to become true. Condition protects the associated shared data in a critical section using the capabilities inherited from Mutex. In order to wait for the assertion of a predicate, a thread invokes operation wait, which implicitly unlocks the shared data and puts the thread to wait. Several threads can be waiting on the same condition. The assertion of a predicate can be announced in two ways: operation signal announces it to the ?rst waiting thread, and operation broadcast announces it to all waiting threads. When a thread returns from the wait operation, it implicitly regains control over the critical section. The global scenario aspect remote, which will be described later in section 4.7, provides a remote object invocation mechanism that shapes a distributed scenario for abstractions. When used in this scenario, abstractions of the Synchronizer family accomplish a centralized solution to coordinate processes of a parallel application. The solution consists in having one of the processes to create the synchronizer locally, while the remaining processes share it via the remote object invocation mechanism. However, this mech-


anism was not designed focusing distributed coordination and may cause a bottleneck in the parallel application7 . Therefore, a specialization of the remote scenario aspect has been considered for the Synchronizer family to optimize this scenario. A fully distributed solution, however, can only be achieved with the introduction of a new family member. A negative specialization (cancellation) was speci?ed for the atomic global scenario aspect [section 4.6], since operations on synchronizers are inherently atomic.


Inter-Process Communication

Inter-process communication in Epos is delegated to the families of abstractions shown in ?gure 11. Application processes communicate with each other using a Communicator, which acts as an interface to a communication Channel implemented over a Network. The messages sent through a Communicator can be speci?ed as sequences of bytes of a known length, or they can be covered by an Envelope.
Commu? nication



Commu? nicator



Figure 11: Families of abstractions concerning inter-process communication in Epos.



A communicator is an end-point for a communication channel that enables application processes to exchange data with each other. Therefore, when an application selects a communicator, it implicitly designates the kind of
The node in which the synchronizer resides becomes a kind of “coordinator”, with which all other processes have to communicate.


communication channel that will be used. Communicators, like most other system abstractions, are assigned to tasks, thus being shared by their threads. Communicators are realized in Epos by the Communicator family of abstractions shown in ?gure 12, which was modeled as a dissociated family whose members can be simultaneously deployed.
Channel Communicator remote








Figure 12: Epos family of communicators. The Link member of the Communicator family realizes an end-point for logical connections between processes that carry byte streams. The Port and Mailbox members realize end-points for a communication channel in which datagrams ?ow, but a port always belongs to a single task, while mailboxes can be shared among tasks. The ARM Segment (Asynchronous Remote Memory Segment) member of the Communicator family realizes an end-point for a communication mechanism that supports asynchronous access to a memory segment in a remote node. This mechanism is asynchronous because processes manipulating an ARM Segment are not implicitly synchronized and can corrupt the data in that segment. Data read from a remote segment becomes local and private to the reading process. If necessary, synchronization has to be achieved by other means (e.g. distributed semaphores). In order to use this communicator, a process speci?es a memory segment on a remote node that has been previously exported by its owner. It can then invoke operations to read from and to write to this segment (asynchronous remote memory segments are not mapped into the address space of processes). The AM Handler (Active Message Handler ) member of the Communicator family realizes an end-point for active messages [vECGS92]. The basic idea behind this concept is that a message, besides transporting data, also car27

ries a reference to a handler that is invoked, in the context of the receiving process, to handle the message upon arrival. This kind of communication is sometimes called single-sided because the receive operation is not explicitly expressible. For this mechanism to work properly, means must be provided to the sending process so it can specify a handler that is valid in the context of the destination process. The most typical answer to this issue is to deploy active messages in an SPMD (Single Program, Multiple Data) environment, in which all processes have an equivalent address space. However, indirection mechanisms and the exchange of handler references using other communication mechanisms are also possible. Active messages have been modeled in Epos in such a way that communication is hidden behind a remote handler invocation, with messages being indirectly exchanged as arguments to the handler. When a process instantiates an AM Handler, it supplies a reference to a handler on a remote process. Afterwards it can invoke the handler supplying arguments that are transparently marshaled in a message and delivered to the remote handler. The DSM Segment member of the Communicator family, which would realize a Distributed Shared Memory (DSM) mechanism for Epos, could have been modeled as an extension of the ARM Segment communicator and of a member of the Segment family of memory segments. Unlike an ARM Segment, however, a DSM Segment would be attached to the address space of processes, dispensing with explicit read and write operations. It would also implement a mechanism to grant data coherence. This communicator would enable application programmers to write parallel applications for distributed memory machines as if they were shared memory ones. A negative specialization (cancellation) was speci?ed for the remote global scenario aspect [section 4.7], since a process is not allowed to exchange messages using communicators created by other processes on remote nodes. 3.4.2 Channels

A communication channel is the entity e?ectively responsible for interprocess communication in Epos. It uses network resources to build a logical communication channel through which messages are exchanged. A channel implements a communication protocol that, according to the Basic Reference 28

Model for Open Systems Interconnection (ISO/OSI-RM) [ISO81], would be classi?ed at level four (transport). Epos family of communication channels is depicted in ?gure 13. It was modeled as a dissociated family, whose members are indirectly accessed through the corresponding members of the Communicator family.

asynchronous Channel synchronous










Figure 13: Epos family of communication channels. A communication channel has an implicit capacity [Sha48]. Trying to insert a message into a saturated channel causes the transmitter to wait until the channel can accommodate the message. Likewise, the attempt the extract a message from an empty channel causes the receiver to wait until a message is available. Whether a thread waiting on a channel performs busy or idle waiting hinges on the related con?gurable feature from the CPU Scheduler family [section 3.2.3]. Notwithstanding this, a channel can have its capacity extended by enabling the buffering con?gurable feature. In this case, messages sent through a saturate channel are accumulated for posterior handling. Sometimes it is desirable to fork a channel, so that a transmitted message is simultaneously multicasted to several receivers, or broadcasted to all receivers. The collective operations used in many parallel applications could be considerably optimized in this way. Epos allows a channel to be forked when the grouping con?gurable feature is enabled. In this case, special identi?ers are supplied to designate a group of communicators as the recipient of 29

a message. The e?ect of buffering and grouping con?gurable features on a channel is illustrated in ?gure 14.

Buffered Channel



Split Channel


Figure 14: The e?ect of con?gurable features buffering and grouping on communication channels. The synchronous scenario aspect yields an execution scenario in which the operations used to inject a message into a channel only conclude when the message is extracted from the channel at the receiver’s side. In a synchronous communication scenario, processes involved in a message exchange are said to make a “rendezvous”. Conversely, the asynchronous scenario aspect modi?es these operations so they conclude as soon as the delivery of a message is negotiated. If the buffering con?gurable feature is enabled, this is achieved by copying the message into a bu?er and scheduling it for delivery. Otherwise, the sender is supposed not to modify the message until indicated to do so. An operation is provided that enables a process to check for this condition. The Stream member of the Channel family realizes a connection-oriented channel that can be used to transfer streams of bytes. It pairs up with the Link communicator. The Datagram member realizes a channel that supports the transmission of datagrams. It has two possible end-points: Port and Mailbox. Three members concern asynchronous access to a remote mem30

ory segment: ARMR, ARMW, and ARMC. They realize communication channels respectively for reading, writing, and copying (reading and writing) from/to a remote memory segment and are delivered to applications through the ARM Segment communicator. The AM (Active Message) channel specializes ARMW to introduce the concept of a message handler that is automatically invoked when the message reaches its destination. It pairs up with the AM Handler communicator. The communication channel used to support distributed shared memory would specialize the ARMC channel in order to map it to the address space of processes. 3.4.3 Networks

A communication channel is, at last, an abstraction of a network, in that networks provide the physical means to build logical channels. The idiosyncrasies of each network technology, however, could require the members of the Channel family to be specialized too often. This picture was prevented in Epos by modeling networks as members of a uniform family of abstractions, so that all networks are equivalent from the standpoint of channels. The uniform design of the Network family, which is outlined in ?gure 15, should not subdue special features delivered by a particular network, since abstractions in this family implement high-level transport services that are seldom implemented by the hardware. The virtual networks in this family can use special services provided by the network to optimize the implementation of such transport services. Some of these special features are used to implement the con?gurable features modeled for the family.
ordering flow_control reliability broadcast multicast







Figure 15: Epos family of networks.


The Network family features a member for each network technology supported in the system (e.g. Fast Ethernet and Myrinet). Each member encapsulates a physical network device that has been previously abstracted by the Device family [section 3.6.2]. The family also features a Loop device that is used to establish a communication channel between processes executing on the same node. In principle, abstractions in this family are used indirectly through a communicator, but they are also made available for the convenience of applications that need, for instance, to implement special communication protocols. A set of con?gurable features, corresponding to operational modes, was modeled for the Network family. These features are interpreted as follows: ordering requires messages sent through a network to be delivered at the destination in the same order they were sent; flow control requires a network abstraction to implement ?ow control; reliability requires a network to assure error-free delivery of messages; broadcast enables the interpretation of broadcast addresses, so messages can be broadcasted to all hosts in the local network; multicast enables the interpretation of multicast addresses, causing a message to be delivered at multiple hosts. These con?gurable features are usually specialized for each family member to pro?t from eventual hardware support. 3.4.4 Message Envelopes

The members of the Communicator family can be used to exchange unstructured messages in the form of sequences of bytes of a certain length. However, it might be adequate for some applications to count on an envelope abstraction to cover a message before it is sent. Such an envelope would be allocated from the operating system, loaded with one or more messages, and then inserted into a communication channel through a communicator. An envelope allocated by the operating system would enable several optimizations, ranging from cache alignment to zero-copy processing. Besides, additional information can be put in the envelope to describe and protect messages. After all, an envelope would enable a comfortable syntax to express communication in object-oriented applications, for example: Envelope envelope(recipient, length); envelope << "Hello world!"; 32

communicator << envelope; Epos supports the concept of message envelope through the Envelope uniform family of abstractions represented in ?gure 16. The maximum length of message that an envelope can hold is speci?ed when it is instantiated, while the e?ective length of the message(s) it contains is dynamically determined. An envelope must be addressed before it is posted.



Figure 16: Epos family of message envelopes. The Envelope family comprises two members: Untyped and Typed. The former realizes a simple message envelope that can be used to gather messages before sending, while the latter collects type information for each message inserted to enable format conversions on heterogeneous systems. A secure envelope was not modeled due to the characteristics of a dedicated computing system, which usually do not require encryption nor authentication of messages8 .


Time Management

Time is managed in Epos by the Timer family of dissociated abstractions shown in ?gure 17. The Clock abstraction is responsible for keeping track of the current time. It is only available on systems that feature a real-time clock device. The Alarm abstraction can be used to put a thread to “sleep” for a certain time. It can also be used to generate timed events. For this purpose, an application instantiates an abstraction from the Interrup Handler family (see section 3.6.3) and registers it with Alarm specifying a time interval and the number of times the handler object is to be invoked9 .
8 9

A discussion about protection in dedicated systems is presented in section 4.4. This mechanism is also used to activate the process scheduler.






Figure 17: Epos family of timers. The Timer family is completed by the Chronometer abstraction, which is used to perform time measurements. The unit of reference for these abstractions is the second, with times and delays represented as real numbers. The precision of these abstractions, however, depends on the hardware platform on which they are implemented.


I/O Management

The interaction between applications and peripheral devices is managed in Epos by the families of abstractions represented in ?gure 18. As a rule, peripheral devices are abstracted in a way that is convenient to applications by the members of the Device family. However, dedicated systems often deploy dedicated devices that will not be found on this family. Therefore, Epos also delivers applications means to directly interact with a peripheral device. In this context, the Bus family is responsible for detecting and activating devices connected to a bus, which are abstracted as dynamically created members of the Device family. The Interrupt Handler family of abstractions allows applications to handle interrupts generated by a device.




Interrupt Handler

Figure 18: Families of abstractions concerning I/O management in Epos.




Epos family of I/O bus abstractions is responsible for detecting and activating physical devices connected to a bus. The Bus family sketched in ?gure 19 was modeled as a dissociated family, with a member for each supported bus technology (e.g. ISA, PCI, SCSI). When a member of the Bus family is invoked to activate a device, it arranges for the application process to have total control over the device, mapping eventual memory and I/O ports to its address space. According to the bus technology, operations are provided to further con?gure devices and sometimes the bus itself.


... bus_n

Figure 19: Epos family of buses.



From a historic perspective, device drivers are some of the most important pieces of an operating system. From the beginning, operating systems have the duty of hiding the peculiarities of each device from application programs, easing programming and allowing them to survive steady upgrades. As an application-oriented system, Epos extends the notion of device driver as a set of I/O routines, modeling devices as abstractions organized in a family. The makeup of this family is depicted in ?gure 20. Di?erently from the families of abstractions presented until now, the commonalities of the Device family are not collected in a single package, but split according to bus technologies (bus 1 dev through bus n dev on the diagram). Each bus technology de?nes a dissociated subfamily of devices. The decision for a dissociated family of devices de?es the uniform organization consolidated by Unix, in which all devices adhered to a common pseudo-?le interface [Tho78]. This interface has an “escape” operation, namely ioctl, that is used to pack operations that cannot be represented







... dev_n


... dev_n

Figure 20: Epos family of device abstractions. under the ?le interface10 . Epos gives priority to the preservation of the individual characteristics of each device, allowing them to de?ne their own operations and eliminating the “hideous” ioctl operation. Nevertheless, the lack of a common interface makes it more di?cult to add a device at runtime. Indeed, when an application invokes a member of the Bus family to detect and activate a device, the returned device abstraction is supposed to be handled at user-level in the scope of the calling process. All but busspeci?c operations for this device have to be implemented by the application itself. When Epos is con?gured as a kernel, neither the kernel nor other processes know about such a device. If sharing is required, a device “server” process has to be devised. 3.6.3 Interrupt Handlers

Epos allows application processes to handle hardware generated interrupts at user-level via the Interrupt Handler family of abstractions depicted in ?gure 21. The three members of this dissociated family can be used simultaneously. The Function IH member assigns an ordinary function supplied by the application to handle an interrupt. The system transforms such a function in an interrupt handler that is invoked every time the associated interrupt is generated. In contrast, the Thread IH member assigns a thread to handle an interrupt. Such a thread must have been previously created by the application in the suspended state. It is then resumed at every occurrence of the corresponding interrupt. After handling the interrupt, the thread must
10 Linux ioctl list man page brings an incomplete list of ioctl operations valid on kernel 1.3.27. It comprises 412 operations.


return to the suspended state by invoking the respective operation.






Figure 21: Epos family of interrupt handlers. Nevertheless, these two members of the Interrupt Handler family may present problems if an interrupt is successively generated while the associated handler is active. The Function IH handler passes the issue on to the application, which must either grant the handler is fast enough, or implement a reentrant function. The Thread IH handler is even more restrict in this regard, since resuming an already active thread has no e?ect. The issue of interrupt loss is addressed by the Semaphore IH, which assigns a semaphore, previously created by the application and initialized with zero, to an interrupt. The operating system invokes operation v on this semaphore at every interrupt, while the handling thread invokes operation p to wait for an interrupt. The strategy gives interrupt handling a producer/consumer ?avor and prevents interrupts from being lost. Nevertheless, preventing interrupts from being lost may not be enough to grant that I/O data will not be lost. A device that generates a subsequent interrupt while a former is still being handled must have memory to accumulate eventual data. If this is not the case, a handler must be conceived that is able to match up the device’s operational speed. Moreover, depending on the scheduling policy in force, interrupts may not be handled immediately. For example, if the thread responsible for handling an interrupt has a lower priority than the thread currently being executed, the handler will be enqueued for later execution.



External Abstractions

Some of the system services o?ered by Epos are not implemented on the dedicated computing system itself, but on an external server. For example, ?le services and graphic output. In order to enable these services to be remotely accessed, Epos communication system is emulated over an ordinary operating system. A computer running this emulator becomes a gateway to an Epos-based environment that allows external abstraction to be accessed via remote object invocation (see section 4.7). This scheme is illustrated in ?gure 22 (Epos is represented in the ?gure on its most typical architecture, i.e. fully embedded in the application).
Server server EPOS Node


Generic OS network


Figure 22: Access to external abstractions in Epos. The set of “stub” functions used to perform remote invocation can also be used to give external services an Epos-?avor. For instance, Posix ?le operations can be delivered through a ?le abstraction.



Summary of EPOS Abstractions

A summary of Epos abstractions is presented in tables 1 and 2. Families of abstractions are grouped by category. A brief description of responsibilities, type, and dependencies are given for each family and for each member of a family.



Responsibilities Memory Management



Segment Static Dynamic Address Space Flat Paged

logical memory segments not resizable, ?ags not changeable resizable, remappable logical address space of a process single, contiguously mapped address space multiple, paged address spaces Process Management

⊕ ? ?

Address Space Paged -

Task Exclusive Mutual Thread Exclusive Cooperative Concurrent CPU Scheduler FCFS RR Static Prio Dynamic Prio MultiLevel UserDef

code, data, and resources of a process single-tasking multitasking stack and context of a process single-threading cooperative multithreading (no scheduler) concurrent, scheduled multithreading processor scheduling policies ?rst-come-?rst-served round-robin static priorities dynamic priorities multiple policies user de?ned policy Process Coordination

? ? ⊕ ? ? ?


Segment Flat Paged Segment Exclusive Task CPU Scheduler Alarm RR,Static Prio -

Synchronizer Mutex Semaphore Condition

process coordination mutual exclusion device semaphore variable condition variable

? ? ?


Family types: ⊕ incremental, dissociated, Member types: ? visible to applications, invisible,

uniform. mutually exclusive.

Table 1: Epos abstractions (part I).



Responsibilities Inter-Process Communication



Communicator Link Port Mailbox ARM Segment AM Handler Channel Stream Datagram ARMR ARMW ARMC AM Network Several Envelope Untyped Typed

communication end-points connection for streams non-sharable mailbox for datagrams sharable mailbox for datagrams asynchronous remote memory segment active message handler communication channels streams datagrams asynchronous remote memory read asynchronous remote memory write asynchronous remote memory copy active messages logical networks (OSI level 4) not covered in this table message envelopes envelope for untyped messages envelope for typed messages (heterogeneity) Time Management

? ? ? ? ?

Channel Stream Datagram Datagram ARMx AM Network Device -

? ? ?

Timer Clock Alarm Chronometer

time keeping current time keeping timed event generation time measurement I/O Management

? ? ?


Bus Several Device Several Interrupt Handler FunctionIH ThreadIH SemaphoreIH

device detection and activation not covered in this table physical device abstraction not covered in this table interrupt handling calls a handling function resumes a handling thread invokes v on a handling semaphore

? ? ? ? ? Thread Semaphore

Family types: ⊕ incremental, dissociated, Member types: ? visible to applications, invisible,

uniform. mutually exclusive.

41 Table 2: Epos abstractions (part II).


Scenario Aspects

Application-oriented system design is particularly concerned with scenario independence. When a domain entity is identi?ed, considerations about its origin are made in order to decide whether it will shape an abstraction or a scenario aspect. Epos scenario aspects were modeled in accordance with this principle, yielding reusable pieces of software that can be controllably applied to the system abstractions described in the previous section. Likewise system abstractions, Epos scenario aspects were organized in families according to what they have in common. The commonalities of each family are captured in a common package that is reused by member aspects. Accordingly, families of aspects were classi?ed as uniform, incremental, dissociated, or combined. In order to build a scenario for an application, selected scenario aspects are merged in a scenario construct and applied to abstractions by means of a scenario adapter. These two constructs, however, will be described in the scope of Epos component framework later in section 5.1.


“An object has state, behavior, and identity.” (Grady Booch [Boo94])

This axiom of object-orientation, which evidently also applies to the instances of system abstractions11 , impart that an object has an identity that distinguishes it from all the others. At programming language level, this identity is usually represented by the object’s address in memory (pointer). This form of identity is rather adequate if objects are only manipulated in the scope of the process that created them, but it is inadequate to identify objects that are shared by multiple processes. In this case, the operating system has the duty of generating identi?ers for system objects that are able to distinguish them across all processes. If objects can be shared in a distributed
Instances of system abstractions will be designated “system objects” hereafter, or simply “objects” when no confusion can arise. In this sense, an abstraction corresponds to a “class of system objects”.


environment, identi?ers must be extended to accomplish unique identi?cation across all nodes in the platform. This variability concerning the identity property is typical of scenario aspects, for it a?ects the manifestation of an abstraction in a scenario, but not its internal structure. Therefore, the identity of system objects was modeled in Epos as an incremental family of scenario aspects. This family is represented in ?gure 23. Member Pointer, the family’s basic aspect, simply reuses the fundamental identity of objects, i.e. their address in memory. Since this form of identi?er is always relative to the address space of the process that created the object, it cannot be used for inter-process interactions on objects. The Local member extends the Id family with an identi?er that consists of a pair (type, unit) and uniquely identi?es a system object in a stand-alone system con?guration. The type ?eld is obtained from the traits of an abstraction [section 5.1], while unit represents the rank of an object in the allocation table of its class. Hence, the ?rst instance of a Myrinet network abstraction would be identi?ed as (Traits<Myrinet>::type, 0).









rand len.






Figure 23: Epos family of identi?cation aspects. Member Global extends the Id family so that a network-wide identi?er is assigned to system objects, allowing them to be remotely accessed. The structural extension succeeded by this member consists of adding a logical host number to identi?ers, yielding the tuple (host, type, unit). Although 43

embedding location in identi?ers complicates the migration of objects in a distributed environment, Epos opted for this solution because it is e?cient and ful?lls the demands of most parallel systems. Di?erently from distributed systems, dedicated parallel systems seldom deploy object migration due to the high overhead associated. Other load balancing techniques that do not con?ict with the Global identi?er, such as data set partitioning and global scheduling, are often preferred [SP94b]. Nevertheless, the logical ids realized by this family could be locally remapped to re?ect the location of migrated objects. Local and Global identi?ers have a lifetime that corresponds to the lifetime of the system. That is, if a process creates a system object and later deletes it, a newly created object may be assigned the same identi?er formerly assigned to that object. This, however, should not disturb applications, since Epos keeps a reference counter that prevents objects in a shared scenario (i.e. a scenario for which the Shared aspect [section 4.2] has been enabled) from being deleted while there are still processes using it. Nonetheless, ?awed programming may cause a process to keep on using an identi?er after the respective system object has been destroyed, inadvertently accessing another object. If these incidents are to be handled by the operating system, the Capability identi?er must be used, since its lifetime is restricted to the lifetime of the object it identi?es. The Capability identi?er extends the Id family by adding a randomly generated number to the Global identi?er. System objects are thus identi?ed by the tuple (host, type, unit, rand). The rand ?eld makes identi?ers sparse, reducing the probability of an identi?er being reused to a negligible level [MT86]. The exact length of this ?eld, however, depends on the quality of the random number generator used and on the dynamics and lifetime of applications. Hence, it was modeled as a con?gurable feature.



There are many reasons that lead processes to share resources. Synchronizers are shared to accomplish coordination, memory segments are shared as an intra-node communication mechanism, mailboxes are shared to support sever replication, and so forth. When system resources are shared, the operating system has to provide means to preserve their integrity. For example, if a 44

shared resource has its state modi?ed by one of the sharing processes, the remaining must be assured a coherent view of it. In particular, when a shared resource is destroyed by one of the sharing processes, subsequent operations invoked by the remaining processes have to be handled accordingly.




Figure 24: Epos family of sharing aspects. Epos supports abstractions to be shared through the Shared incremental family of scenario aspects depicted in ?gure 24. When this aspect is enabled, abstractions are tagged with a reference counter by the Referenced family member, so that a shared abstraction is actually only destroyed when the last sharing process releases it. This scenario aspect is specialized for the family of memory segments [section 3.1.1] to provide adequate support for memory sharing. The Enrolled member of the Shared family of aspects extends Referenced to sustain multitasking con?gurations of Epos in which resources need to be reclaimed. In this scenario, Epos takes note of which tasks are sharing each resource, so that resources not returned by an exiting process can be reclaimed. When the Shared aspect is enabled, system abstractions gain two extra constructors12 that are used to designate sharing. The ?rst is the ordinary copy constructor, which takes a reference to an existing system object as argument and returns a share to that object. This constructor is restricted to share abstractions inside the same address space. The second constructor takes an identi?er as argument and therefore can be used independently of
The term constructor is being used here independently from the C++ programming language to designate a function used to instantiate an object. Applications written in programming languages that do not explicitly support the concept would be supplied operations with equivalent semantics.


locality, inclusive to share remote objects. A C++ program could deploy these constructors as follows: Abstraction instance; Abstraction share1(instance); Abstraction share2(;



In all-purpose operating systems, the memory internally used to store the state of system-level abstractions is expected to be dynamically allocated, since it is impossible to predict which abstractions and how many instances of each abstraction will be e?ectively required by forthcoming applications. Some dedicated computing systems, however, have a predictable demand for resources. In these cases, pre-allocating resources may signi?cantly enhance performance. For example, if a process is known to spawn n threads, preallocating these threads—partially initializing the associated control structures and allocating the corresponding stacks—may eliminate a considerable fraction of the thread creation overhead. A more imperative reason to avoid dynamic memory allocation can be observed for abstractions that have (part of) their state stored outside main memory. Several abstractions of physical devices fall in this category. If the device includes memory that is shared with the main processor, or if it memory-maps control registers, then the corresponding system abstraction will have part of its state mapped over the I/O bus. A dynamic memory allocator unaware of such particularities would allocate incoherent instances of the abstraction; making the memory allocator aware of them would compromise portability, for device mapping is an operation that depends on the I/O bus. Situations like this could be handled by an utility that would execute previously to the operating system to pre-allocate indicated abstractions, delivering them later to the operating system. One such a setup utility was designed for Epos and will be described later in section 5.2.1. Epos supports two allocation scenarios for abstractions: Late and Ahead. Both are modeled as members of the Allocated dissociated family of scenario aspects presented in ?gure 25. The former member de?nes a scenario in which the memory needed to hold the state of system abstractions is dynamically 46

allocated as the abstraction is instantiated, while the latter assumes it has been previously allocated. The Ahead aspect relegates the allocation of many abstractions to marking instances “busy”. Epos knows how many instances of an abstraction have been pre-allocated consulting its traits.





Figure 25: Epos family of allocation aspects. A scenario in which the Late Allocated aspect is used with exclusivity is not probable, since some abstractions (e.g. devices) will always be allocated in advance. Therefore, the application of this scenario aspect must be evaluated individually for each abstraction (consulting its traits).



In multitasking environments, various processes compete for memory, processor, network, and other system resources. In such environments, a protection mechanism to ensure that processes do not inappropriately interfere with one another’s activities is desirable. In order to gain access to resources, processes would thus have to obtain explicit authorization from the operating system. Protection in generic systems tends to extend towards security, which concerns preventing unauthorized users from interfering with ongoing computations. However, the mechanisms of authentication and encryption used in the scope of distributed and web-based computing to attain security are rarely required for dedicated computing, because, as a matter of fact, dedicated systems are (temporary) single-user systems. Therefore, Epos protection mechanisms are restricted to control access to resources. These mechanisms are modeled by the Protected incremental family of scenario aspects depicted in ?gure 26. The Checked aspect of the Protected family pro?ts from the Capability identi?er described in section 4.1 as a protection mechanism. That identi?er uses a sparse random number to achieve identi?cation uniqueness, which can 47







Figure 26: Epos family of protection aspects. also serve protection purposes: in order to access a system object, a process has to know its capability. The probability of gaining access to an object by betting can be made insigni?cant if capabilities are su?ciently sparse, what is controlled by the random length con?gurable feature of the Capability aspect. Lengths ranging from 32 to 64 bits shall be in order for most systems. The Permitted protection aspect extends Checked to build a protection scenario in which operations on abstractions are individually authorized or denied for each process [MT86]. In order to achieve this scenario, Permitted wraps the random number generator used by the Capability, adding permission ?ags to the rand ?eld (the rand length con?gurable feature is adjusted to accommodate these extra bits). A new operation on objects of type Capability is supplied that produces restricted versions of a capability as a non-reversible function of the unrestricted version (the owner capability). A list with owner capabilities corresponding to the objects created by each process is maintained by the Enrolled sharing aspect. Access control is thus enforced by the scenario adapter, which checks for permission before allowing an operation to proceed. This family of scenario aspects is specialized for the Segment abstraction [section 3.1.2] to accomplish memory protection if the hardware so allows. In this way, segments are attached to the address space of a process respecting the permissions in the supplied capability. Memory protection is also useful in single-tasking environments: by properly protecting segments, programming mistakes such as “lost pointers” can be detected, thus helping to debug applications. When protection is violated, Epos produces a useful report of the circumstances. Besides, Epos always protects its own memory segments when the Protected scenario is enabled. 48



A thread often goes waiting for events such as I/O completion, thread termination, and message arrival. Some of these events, however, may su?er delays beyond acceptable limits; some may not occur at all. In some cases, it may be more adequate for an application to lose a message sent over a jammed network than delaying a computation to wait for it. Similarly, it may be better for a control application to succeed a prede?ned action than missing a deadline waiting for user intervention. Such situations can be managed by assigning a time limit to operations that may cause a thread to wait. If the operation does not conclude within the speci?ed interval, it is terminated by the operating system with a failure status. A mechanism to specify time-outs for operations was modeled in Epos as a global scenario aspect. However, associating time-outs to operations that are not eligible to su?er delays would add unnecessary overhead to the system. Therefore, the Limited scenario aspect responsible for timeouts is only applied to abstractions after consulting their traits [section 5.1]. This scenario aspect belongs to the Timed dissociated family represented in ?gure 27. Limited relies on the Alarm timer to implement time-outs.




Figure 27: Epos family of timing aspects. A timing aspect that is less often deployed is realized by the Delayed member of the Timed family. It supports operations on system abstractions to be delayed by a certain time. It is useful, for instance, to migrate embedded applications with strict timing dependencies to a faster hardware platform, or to achieve a homogeneous execution time for processes of a parallel application running on heterogeneous speed nodes.




Events in a parallel program are no longer strictly ordered as in sequential programs, they can occur simultaneously. When multiple threads are allowed to execute in parallel, it its possible that they simultaneously “enter” the operating system, i.e. pass to execute operating system code to accomplish a system service. An operating system that allows for this scenario is said to be reentrant and is itself subject to race conditions. Reentrance is often approached in monolithic systems by properly coordinating the execution of critical sections identi?ed in the course of system development. The dynamic architecture of an application-oriented operating system, however, makes this solution impractical. The combination of abstraction, aspects, and con?gurable features may have profound consequences on system architecture, invalidating some of the hand-made critical sections. Therefore, Epos handles the synchronization pitfalls brought about by reentrance ensuring that system operations are atomic. In this way, transformations of the state of system objects either occur completely or do not occur at all. The atomicity property was modeled to be orthogonal to abstractions, yielding the uniform family of scenario aspects depicted in ?gure 28. The Uninterrupted aspect achieves atomicity by disabling the generation of hardware interrupts that could cause the processor to be preempted from a thread in the middle of a system operation. This family member has a low overhead, but it may also have undesirable side-e?ects on applications. Disabling interrupts may spoil I/O subsystems due to delays in interrupt handling and cause “scheduling skew”. Furthermore, the Uninterrupted scenario aspect is not suitable to be used in multiprocessor environments, since threads executing in parallel can simultaneously invoke system abstractions independently of interrupts. Therefore, three other members were modeled for the Atomic family that do not assume interrupts to be disabled. All three deploy the Mutex member of the Synchronizer family of abstractions [section 3.3.1] to transform every system operation in a critical section13 . They also share similar run-time
13 When the idle waiting con?gurable feature of CPU Scheduler is enabled in an Atomic scenario, the operation used to block a thread is modi?ed so that blocked threads temporarily leave the critical section, thus enabling other threads to invoke system operations. This modi?cation a?ects only the critical sections de?ned by the Atomic aspect; critical sections de?ned by the application are not a?ected.








Figure 28: Epos family of atomicity aspects. overheads, corresponding to the invocation of Mutex methods to accomplish the mutually exclusive execution of system operations. Besides, they lock unallocated system resources under a single mutex, providing atomicity to constructors, destructors, and other class operations. These three members di?er in the degree of parallelism they sustain. The System Locked member uses a single mutex for all abstractions, yielding a scenario in which a single thread is allowed to execute system operations at a time (null reentrance). Reentrance is actually supported by the Class Locked and Object Locked scenario aspects. The former assigns a di?erent mutex to each class of system objects, so that operations on objects of di?erent types can be invoked simultaneously. The latter assigns each system object its own mutex, achieving the highest level of reentrance14 . Although the run-time overhead incurred by these scenario aspects is equivalent, their consumption of synchronization resources is extremely diverse: System Locked uses a single mutex, Class Locked uses a “global” mutex plus one mutex for each abstraction con?gured for the system, and Object Locked uses one mutex for each system object created plus the “global” one.


Remote Invocation

In a distributed environment, processes may need to access system resources residing on remote nodes. In order to do so, an application would have to create a process on each node containing useful resources and deploy a comSystem abstractions that are intrinsically atomic, such as Mutex, are properly escaped by the scenario adapter [section 5.1].


municator to interact with them. However, the burden of explicit message passing for accessing remote resources can be eliminated by a Remote Procedure Call (RPC) [BN84] or, in an object-oriented context, by a Remote Object Invocation (ROI) [LT91] mechanism. These mechanisms hide inter-process communication behind ordinary method invocations, so that processes can transparently access remote resources. Epos supports remote object invocation as a scenario aspect that can be transparently applied to virtually any abstraction. This aspect is realized by the Remote scenario aspect represented in ?gure 29. It relies on the Port communicator for message exchange, on the Global Id (or a derivative) scenario aspect for object location, and on the Shared scenario aspect to control global sharing.
Global_Id Remote Shared


Figure 29: Epos remote invocation scenario aspect. An object invocation in a Remote scenario is depicted in ?gure 30. The object being remotely manipulated is represented in the client’s process domain by a proxy, and mediated on its own domain by an agent. When an operation is invoked on the object’s proxy, the arguments supplied are marshaled in a request message and sent to the object’s agent. This message is addressed to a well-known ROI port on the node on which object resides (which is designated by the host ?eld of its identi?er). This port is listened by the agent using a private thread, thus preventing the blockage of ongoing computations. When the agent receives a request, it unpacks the arguments and performs a local method invocation. The whole process is then repeated in the opposite direction, producing a reply message that carries eventual return arguments back to the client process. The number of threads e?ectively created by the system to listen the ROI port is controlled by the ROI threads con?gurable feature, with zero meaning that threads are dynamically created for each invocation. The placement of the ROI port and ROI agents, however, varies according to Epos resultant architecture. In the embedded-in-the-application con?guration, these entities belong to the single application process, and, in the ?-kernel con?guration, 52

request call port port






Figure 30: A remote object invocation in the Remote scenario. they belong to a separate ROI server process. Applying the Remote scenario aspect to abstractions would require a special scenario adapter, able to cope with proxies and agents. This could be a problem for ordinary scenario adapters that are hand-written, since a new adapter would have to be de?ned for every abstraction. This problem was eliminated by inserting placeholders in Epos component framework to install ROI proxies and agents (?gure 31), which can be automatically generated from the interface of abstractions by a tool (see section 6 for a description of how Epos syntactical analyzer can be used to obtain the operation signatures of all abstractions). These proxies and agents would be arranged in the proper places when Remote is enabled, being replaced by metaprogrammed dummies, which are completely eliminated during compilation, otherwise.
Scenario Adapter Client Proxy Agent Abstraction

Figure 31: The Remote scenario aspect adapter. A remote object invocation is ?rstly expressed in a program using the extra constructor de?ned by the Shared aspect for all system abstractions. A process knowing the id of a remote object can use such a constructor to obtain a remote share of the object. Subsequently, that process can invoke operations on the object disregarding locality. For example, a remote thread could be suspended as follows: Thread thread(remote_thread_id); thread.suspend(); 53

and a thread could be created on a remote node in this way: Thread remote_thread(remote_task_id, entry_point, arguments...); The actual owner of this thread would be the ROI agent on the node where it was created.


Debugging and Pro?ling

Being able to trace the invocation of system operations, or to watch the state of system abstractions, can be useful to debug application programs. Likewise, being able to summarize how much time an application spends with each system abstraction may be a source of optimization. Epos supports these features through the Debugged family of dissociated scenario aspects depicted in ?gure 32. The device used to report debugging and pro?ling information is selected through the outupt dev con?gurable feature. The scenario aspects in this family are interpreted as follows: Watched causes the state of a system object to be dumped every time it is modi?ed; Traced causes every system invocation to be signalized; and Profiled audits the time spent by a process with each system operation, producing a report when the process terminates. The amount of information produced for each abstraction is controlled by its traits.







Figure 32: Epos family of debugging and pro?ling aspects.



Summary of EPOS Scenario Aspects

A summary of Epos scenario aspects is presented in table 3. A brief description of responsibilities and dependencies are given for each family of scenario aspects and for each member of a family.


Scenario Aspect Identi?cation ⊕ Pointer Local Global Capability Sharing ⊕ Referenced Enrolled Allocation Late Ahead Protection ⊕ Checked Permitted Timing Limited Delayed Atomicity Uninterrupted System Locked Class Locked Object Locked Remote

Debugging Watched Traced Pro?led

Responsibilities Dependencies system object identi?cation intra-process id local node id SAN-wide id sparse SAN-wide id sharing of system objects among processes reference counter clients list memory allocation for system objects dynamic allocation pre-allocation Setup system object access control id knowledge Capability permitted operations Enrolled system operations timing Alarm time-out delay system operations reentrance interrupt disabling global monitor Mutex abstraction monitor Mutex system object monitor Mutex remote object invocation Port, Global Id, Shared system debugging and pro?ling Device system object watching system operation tracing system operation pro?ling dissociated, uniform. mutually exclusive.

Aspect types: ⊕ incremental,

Table 3: Epos scenario aspects.



System Architectures

As an application-oriented operating system, Epos has a highly scalable architecture that is modeled to accomplish the needs of applications it supports. Ultimately, Epos architecture results from the organization of the components selected for a given system con?guration. Distinct combinations of system abstractions and scenario aspects lead to di?erent software architectures, some of which are delivered as a kernel, others are completely embedded in the application. Therefore, the component framework represents the core of Epos software architecture, for it dictates how abstractions can be combined considering the peculiarities of the target execution scenario. Notwithstanding the signi?cance of a component framework to an application-oriented system design, a thorough handling of portability and initialization issues is fundamental to accomplish a scalable system architecture. If hardware architectural aspects are absorbed by the operating system’s software architecture, porting it to another platform could obliterate much of the ?exibility o?ered by the component framework. Besides, the bare hardware on which an operating system performs is seldom prepared to deal with the high-level language constructs of a component framework. The hardware platform has to be appropriately initialized to house a composite of application-oriented system abstractions.


Component Framework

An application-oriented component framework captures elements of reusable system architectures while de?ning how abstractions can be arranged together in a functioning system. In this context, Epos component framework was modeled as a collection of interrelated scenario adapters that build a “socket board” for system abstractions and scenario aspects. These are “plugged” to the framework via in?ated interface binding. Epos component framework is realized by a static metaprogram and a set of composition rules. The metaprogram is responsible for adapting system abstractions to the selected execution scenario and arranging them together during the compilation of an application-oriented version of Epos. Rules coordinate the operation of the metaprogram, specifying constraints and dependencies for the composition of system abstractions. Composition rules 57

are not encoded in the metaprogram, but speci?ed externally. They are interpreted by composition tools in order to adjust the parameters of the metaprogram. The separation of composition rules from the framework metaprogram allows a single framework to yield a variety of software architectures. Indeed, one could say that Epos has many frameworks, each corresponding to the execution of the metaprogram with a di?erent set of arguments. Moreover, the use of static metaprogramming to compose system abstractions does not incur in run-time overhead, thus yielding composites whose performance is directly derived from their parts. 5.1.1 Framework Metaprogram

Epos component framework metaprogram is executed in the course of a system instance compilation, adapting selected abstractions to coexist with each other and with applications in the designated execution scenario. During this process, scenario-independent abstractions have their original properties preserved, so that internal compositions can be carried out before scenario adaptation. This is accomplished having the framework metaprogram to import scenario-independent abstractions in one namespace and export the corresponding scenario-adapted versions in another. For example, the cascaded aggregation of Communicator, Channel, and Network [section 3.4] takes place at the scenario-independent level. The resultant composite is later adapted to the selected scenario as a whole. Similarly, the Atomic scenario aspect instantiates a scenario-independent Mutex [section 4.6], i.e. without the adaptations performed by the framework metaprogram, which might have transformed it, for instance, in a remotely accessible, shared, and protected abstraction to match up the scenario required by the application. Figure 33 shows a top-view diagram of the component framework static metaprogram. Each of the elements represented in the ?gure will be subsequently described. A class diagram representing the Handle framework element is depicted in ?gure 34. Like most other elements, parameterized class Handle takes a system abstraction (class of system objects) as parameter. When instantiated, it acts as a “handle” for the supplied abstraction, realizing its interface in order that invocations of its methods are forwarded 58







Agent <<msg>>





Figure 33: A top-view of Epos component framework metaprogram. to Stub. Hence, system objects are manipulated by applications via their “handles”. Handle provides additional operations to check if a system object was successfully created15 and to obtain its id. Besides, when the Shared scenario aspect [section 4.2] is enabled, Handle provides the extra constructors used to designate sharing in that scenario. The aggregation relationship between Handle and Stub enables the system to enforce allocation via a system allocator (instead of a programming language one), thus allowing for the Allocated scenario aspect16 . The Stub framework element is depicted in ?gure 35. This parameterized class is responsible for bridging Handle either with the abstraction’s scenario adapter or with its proxy. It declares two formal parameters: an abstraction and a boolean ?ag that designates whether the abstraction is local or remote to the address space of the calling process. By default, Stub inherits the abstraction’s scenario adapter, but it has a specialization, namely Stub<Abs, true>, that inherits the abstraction’s proxy. Therefore, making Traits<Abstraction>::remote = false causes Handle to take the
The use of C++ exceptions as a mechanism to signalize system object creation failure was avoided because it would make di?cult the integration of Epos with applications written in other programming languages. 16 The allocator used with each abstraction is selected by Adapter after consulting Traits<Abs>::allocated.


Abs +operation(parms) : result

Abs Handle +Handle(...) +Handle(Id) +Handle(Handle) +valid() : Boolean +id() : Id +operation(parms) : result Traits ... +remote ...



return stub?>operation(args);

Stub<Abs, Traits<Abs>::remote>

Figure 34: Epos framework: the Handle element. scenario adapter as the Stub, while making it true causes Handle to take the proxy.
Adapter Abs Proxy Abs

Abs, remote Stub Stub

Abs, true

Figure 35: Epos framework: the Stub element. The Proxy framework element is deployed when the remote scenario described in section 4.7 is established. Proxy realizes the interface of the abstraction it represents, forwarding method invocations to its Agent (see ?gure 36). Each instance of Proxy has a private ROI message, which is initialized in such a way that forthcoming invocations only need to push parameters into the message. Moreover, because Proxy is metaprogrammed, parameters are pushed directly into the message, without being pushed into the stack ?rst. Proxy operations invoke method invoke17 to perform a message ex17

The semantics of the invoke method varies according to the selected con?guration.


change with Agent, which, likewise Handle for a local scenario, forwards invocations to the abstraction’s Adapter.
Abs Adapter +operation(parms): result Abs

Message Proxy +operation(parms): result ?invoke(method) Abs +put(...) +get(...) <<message exchange>> Agent +perform(message) Abs

msg?>put(parms); invoke(OPERATION); result res; msg?>get(res, parms); return res;

msg?>get(parms); result res = adapter?>operation(parms); msg?>put(res, parms); reply();

Figure 36: Epos framework: Proxy and Agent elements. The Adapter framework element is depicted in the ?gure 37. This parameterized class realizes a scenario adapter for the abstraction it takes as parameter, adapting its instances to perform in the selected scenario. Adaptations are carried out by wrapping the operations de?ned by the abstraction within the enter and leave scenario primitives, and also by enforcing a scenario-speci?c semantics for creating, sharing, and destroying its instances. The role of Adapter in Epos framework is to apply the primitives supplied by Scenario to abstractions, without making assumptions about the scenario aspects represented in these primitives. In this way, Adapter is able to enforce any combination of scenario aspects. The execution scenario for Epos abstractions is ultimately shaped by the Scenario framework element depicted in ?gure 38. Each instance of parameterized class Scenario delivers scenario primitives that are speci?c to the abstraction supplied as parameter. Firstly, it incorporates the selected Id aspect, which is common to all abstractions in a scenario; then it consults
In some cases, it causes the application process to “trap” into the kernel, in others, it directly accesses a communicator to perform a message exchange.


Abs Scenario +enter() +leave() Abs +operation(parms): result



+operator new() +share(Id) : Adapter* +share(Adapter *) : Adapter* +free(Adapter*) +operation(parms): result

enter(); Result res = Abs::operation(args); leave(); return res;

Figure 37: Epos framework: the Adapter element. the abstraction’s Traits to determine which aspects apply to it, aggregating the corresponding scenario aspects. The strategy to cancel an aggregation is similar to the one used with Stub, i.e. a parameterized class that inherits the selected aspect by default, but is specialized to inherit nothing in case the aspect is not selected for the abstraction. Besides designating which scenario aspects apply to each abstraction, the parameterized class Traits maintains a comprehensive compile-time description of abstractions that is used by the metaprogram whenever an abstraction-speci?c element has to be con?gured. 5.1.2 Composition Rules

Epos component framework metaprogram is able to adapt and assemble selected components to produce an application-oriented operating system. However, though the metaprogram knows about particular characteristics of each system abstraction from its traits, it does not know of relationships between abstractions and hence cannot guarantee the consistency of the composites it produces. In order to generate a meaningful instance of Epos, the metaprogram must be invoked with a coherent parameter con?guration. Therefore, the operation of the framework metaprogram is coordinated by a set of composition rules that express elements of reusable system ar62

Abs Traits ... +shared +allocated +protected +timed +atomic +debugged ... Id Abs Scenario +alloc() +share() +free() +enter() +leave() +static_enter() +static_leave()
0..1 0..1

Shared Allocated

0..1 0..1 0..1

Timed Atomic Debugged

Figure 38: Epos framework: the Scenario element. chitecture captured during design. A consistent instance of Epos comprises system abstractions, scenario aspects, hardware mediators, con?gurable features, and non-functional requirements. Composition rules specify dependencies and constraints on such elements, so that invalid con?gurations can be detected and rejected. Nevertheless, guarantying that a composite of Epos elements is “correct” would depend on the formal speci?cation and validation of each element, what is outside the scope of this research. Sometimes, composition rules are implicitly expressed during the implementation of components. For example, by referring to the Datagram channel, the Port communicator implicitly speci?es a dependency rule that requires Datagram to be included in the con?guration whenever Port is deployed. However, most composition rules, especially those designating constraints on combining abstractions, can only be expressed externally. For instance, the rule that expresses the inability of the Flat address space to support the Mutual task abstraction must be explicitly written. In order to support the external speci?cation of composition rules, Epos 63

elements are tagged with a con?guration key. When a key is asserted, the corresponding element is included in the con?guration. Elements that are organized in families are selected by assigning a member’s key to the family’s key, causing the family’s in?ated interface to be bound to the designated realization. This mechanism implements selective realize relationships modeled during design. For example, writing Synchronizer := Semaphore causes the in?ated interface of the Synchronizer family of abstractions to be bound to member Semaphore and writing Id := Capability binds the Id scenario aspect to Capability. Elements that do not belong to families have their keys asserted accordingly. For example, writing Busy Waiting := True enables the Busy Waiting con?gurable feature if the CPU Scheduler abstraction. Composition rules are thus de?ned associating pre- and postconditions to con?guration keys. For instance, the following rule for the Task family of abstractions requires the Paged address space to be selected before the Mutal task can be selected: Mutual ? pre: Address Space = Paged

Alternatively, this constraint could be expressed as a composition rule for the Address Space family that selects the Exclusive task whenever the Flat address space is selected: Flat ? pos: Task := Exclusive

Composition rules are intended to be automatically processed by con?guration tools. Hence, it is fundamental to keep them free of cycles. The following rule, though understandable for a human, could bring a tool to deadlock: A1 ? pre: B1 ? pre: B = B1 A = A1

In order to ensure that Epos composition rules build a direct acyclic graph, the following directives were observed: ? Con?guration keys are totally ordered according to an arbitrary criterion; 64

? Preconditions are restricted to expressions involving only preceding keys; ? Postconditions are restricted to assignments involving only succeeding keys. Along with the traits of each abstraction, composition rules control the process of tailoring Epos to a particular application. The set of con?guration keys selected by the user is validated and re?ned by means of composition rules, yielding a set of elements that are subsequently assembled by the framework metaprogram consulting the traits of abstractions.



The steady evolution of computing systems makes it easy to predict that many dedicated applications will experience more than a single hardware platform along their life cycles. From an application-oriented perspective, applications should transparently endure such hardware migrations, delegating portability issues to the run-time support system and the compilation environment. Therefore, Epos visible elements were designed in such a way as to conserve syntax and semantics when ported to new hardware platforms. Elements may be unavailable in a platform because it does not feature the necessary hardware support, but those available behave accordingly in all platforms. The extremely ?exible architecture of Epos requires portability issues to be dealt with consequently. If abstractions, scenario aspects, or component framework elements incorporate hardware idiosyncrasies, porting them to a new hardware platform could impair Epos and consequently applications running on it. Nevertheless, as an operating system that aims at delivering a high-performance foundation to dedicated applications, Epos has to ponder the implications that portability may have on performance [Lie96]. Epos pursues the balance between portability and performance by means of two artifacts: a setup utility and a set of hardware mediators. The setup utility runs previous to the operating system to prepare the hardware platform to host a mostly portable system. As the utility builds an elementary execution context for Epos, it initializes numerous hardware components, setting the system free from a main source of non-portability. The setup 65

utility is itself highly dependent from the hardware platform for which it is implemented and does not aim at being portable. Non-portable hardware interactions after the initialization phase are avoided in Epos whenever possible. Sometimes, non-portable hardware mechanisms are replaced by portable ones in software, as long as resources are not compromised. For instance, the context switch operation available in some processors can usually be replaced by a software routine without impairments. Nevertheless, some con?gurations of Epos cannot escape nonportable interactions with the hardware. The Paged address space abstraction, for instance, requires Epos to interact with the MMU to create, destroy, and modify the address space of processes. Architectural dependencies that cannot be handled by the setup utility are encapsulated in hardware mediators. When a system abstraction or a scenario aspect needs to interact with the hardware, it does it via a mediator, thus promoting portability. Hardware mediators, likewise the setup utility, are not portable; they are speci?cally designed and implemented for each platform.



The Setup Utility

Epos setup utility is a non-portable tool that executes previous to the operating system to build an elementary execution context for it. The resources used by this utility are completely released before the ?rst application process is created, so it can lessen on resource rationalization in bene?t of an extensive setup procedure that carefully validates each con?guration step. The setup utility receives a SysInfo structure from the bootstrap that describes the relevant characteristics of the forthcoming Epos con?guration, so it knows which devices have to be activated and which elementary memory model has to be implemented. As the utility proceeds with hardware setup, it updates and completes SysInfo, including information about the physical resources con?gured, a memory map describing how the operating system has been loaded, the node’s logical id, etc. This structure is later delivered to the operating system to assist the initialization of portable system components. When the Ahead Allocated scenario aspect is enabled, the setup utility pre-allocates indicated abstractions in the data segment of the operating system, relocating the respective pointers. For instance, if an Fast Ethernet network adapter is marked to be allocated in advance, the setup initializes the device, maps it to a Network system object, and attaches it to the operating system’s list of resources. Di?erently from what one could imagine, the setup utility is seldom a large and complex software artifact. On the low-end of embedded systems, the setup may be relegated to load the operating system, since many of the microcontrollers used in such systems dispense further setup procedures. In contrast, the setup utility for the high-end workstations of a cluster can usually rely on a built-in monitor to con?gure the platform. Besides promoting portability, the setup utility considerably reduces the complexity of other Epos elements, for they no longer need to cope with hardware initialization. A void con?guration of Epos, i.e. a resulting system size of zero bytes, becomes possible in this scenario: some applications only need to be loaded on a pre-con?gured platform, dispensing with further operating system assistance. Such extreme cases are on the summit of application-orientation, demanding the highest degree of scalability from the operating system. They are only manageable in Epos due to the division of operating system responsibilities with the setup utility. 67


Hardware Mediators

A hardware mediator abstracts elements of the hardware platform that are used by system abstractions and scenario aspects. However, it is not the intention of these mediators to building a “universal virtual machine” for Epos, but hiding the peculiarities of some hardware components that are frequently used by the operating system. For example, the memory management unit and the interrupt controller. Mediators realize an operating system interface for these components, thus preventing architectural dependencies from spreading over the system. Mediators are themselves hardware dependent, being sometimes coded in assembly, using macros, or static metaprogramming techniques. The Node hardware mediator depicted in ?gure 39 yields a topmost abstraction for the computer in which Epos executes, being its unique global object. The Node comprises a set of other hardware mediators, including one or more processors (CPU), an interrupt controller (IC), a timer (TMR), and a real-time clock (RTC). These hardware mediators are only included in a system con?guration if the equivalent hardware components are available in the platform and if they were required by some system abstraction or scenario aspect selected by the user. Additionally, Node is associated with the CPU Scheduler.


Node TMR



Figure 39: Epos hardware mediator Node. A special realization of the Node hardware mediator is deployed for symmetric multiprocessor nodes18 that is able to cope with the scheduling and
Asymmetric multiprocessing is handled in Epos with the ordinary Node mediator representing the “main” processor and secondary processors being represented as devices.


synchronization singularities of a parallel environment. This version of Node relies on the Atomic scenario aspect [section 4.6] to coordinate parallel system invocations. Whether processor a?nity is considered for processor scheduling hinges on the corresponding con?gurable feature of CPU Scheduler [section 3.2.3]. The CPU mediator depicted in ?gure 40 aggregates three other mediators: the ?oating point unit (FPU), the memory management unit (MMU), and the time-stamp counter (TSC). As explained earlier, these mediators are not aimed at simulating the hardware components they represent, but to implement an operating system interface for them. For instance, the MMU mediator is invoked to map physical memory to a paged address space, allocating and plugging the necessary frames to the designated page table.
0..1 0..1






Figure 40: Epos hardware mediator CPU. Besides acting as a hardware mediator, CPU implements operations to save and restore the context of processes, and to perform the bus-locked read-andwrite transactions (Test and Set Lock ) required by the Synchronizer family of abstractions. It also holds a reference to the thread currently running on the processor. Together with the setup utility, hardware mediators enable most of the abstractions and scenario aspects described earlier in this chapter to be transparently ported to new hardware platforms. Only some abstractions that directly concern hardware elements, such as Bus and some members of the Device family, are subject to portability pitfalls.




Epos splits the initialization of the computing system it manages in two phases. The ?rst, which was described in the previous section, concerns the initialization of the hardware platform, while the second concerns the initialization of system data structures and the creation of the ?rst (and possibly unique) application process. Both procedures have been designed aiming at supporting the ?exible system architecture delivered by Epos component framework. The procedure of system initialization uses many algorithms and temporary data structures that are never used again during ordinary operation. The architecture of Epos enables the resources allocated during this procedure, in particular memory, to be returned to the pool of free resources, so they can be reused later by applications. The ?rst stage of Epos initialization, the setup utility, completely erases itself when ready with its tasks, releasing all resources it had allocated. The second stage, the init utility, is activated as the last activity of setup, receiving the update Sys Info structure created by the bootstrap. 5.3.1 The Init Utility

Epos init utility is not a process, nor is it part of the operating system. It is just a routine that has plain access to the address space of the operating system, thus being able to invoke system operations. The initialization procedure carried out by the init utility consists in checking the traits of each abstraction to determine whether it has been included in the current system con?guration19 , invoking the init class method for present abstractions. Abstractions and hardware mediators are requested to de?ne this class method, which is responsible for initializing associated operating system structures. Besides having access to the traits of the abstraction it initializes, class method init also receives the Sys Info structure as parameter, so it can consult the system description left by the setup utility. Furthermore, these class methods undergo a special link-editing process that causes them to be linked to the init utility rather than the operating system.
Traits are compile-time structures, so consulting the traits of an abstraction that has not been selected for a system con?guration does not alter the con?guration.


After having invoked the init class method for all present abstractions, the init utility invokes Epos operations, which by now are fully operational, to create the ?rst process. If the dedicated application running on Epos is executed by a single process (per node), then the process created by the init utility is the application’s unique process. Otherwise, this process is a loader that subsequently creates application processes in a multitasking environment. Before ?nishing, init releases all resources it had allocated, leaving Epos alone with application processes.
memory boot image Sys_Info setup utility boot setup memory boot image Sys_Info init application init memory


ahead alloc.


ahead alloc.

physical devices

logical devices

Figure 41: An overview of Epos initialization. An overview of Epos initialization is depicted in ?gure 41. After loading the boot image, which includes a preliminary system description (Sys Info), the bootstrap invokes the setup utility to con?gure the hardware platform. Considering the speci?cations in Sys Info, the setup utility builds an elementary memory model, con?gures required devices, loads Epos, pre-allocates Ahead Allocated abstractions, loads the init utility, and activates it. The init utility, in turn, invokes the init class method of every abstraction included in the system to initialize its logical structure. It ?nishes loading the executable provided in the boot image to create the ?rst process.



System Organization

The initialization procedure described above causes Epos to assume one of the organizations depicted in ?gure 42. If the embedded-in-the-application organization is to be accomplished, the setup utility arranges for a single address space that is shared by the operating system and the application process. Hence, this organization implies the single-tasking environment realized by the Exclusive member of the Task family.





dev dev dev

EPOS (a)

dev dev

EPOS (b)

Figure 42: Epos organizations: (a) embedded-in-the-application and (b) ?-kernel. Though sharing the same address space, application and operating system are not linked together when the embedded-in-the-application organization is selected. Instead, the application is supplied a copy of the operating system’s symbol table during linkage, thus enabling it to invoke system operations via ordinary procedure calls20 . This design decision enables the setup utility to adopt an operating system loading procedure that is common to both organizations, embedded-in-the-application and ?-kernel. Furthermore, avoiding linking operating system and applications together enables the portion of the address space in which the operating system resides to be protected against misleading applications. This is useful, even in single-tasking environments, to assist debugging applications. If the hardEpos component framework metaprogram is processed in the context of applications, therefore a fraction of the operating system will always be embedded in the application. Moreover, some elementary system operations may completely embed in the application (through inlining).


ware platform permits, Epos code segment is marked execute-only, while its data segment is marked accessible only from the code segment. Access violations regarding this scheme are caught and can be monitored activating the Debugged scenario aspect. If Epos is con?gured as a ?-kernel, the setup utility prepares two address spaces: one for the kernel and one for the ?rst process. In this case, the init utility is temporarily loaded in the address space of the kernel. With the ?-kernel organization, Epos features a system-call interface that is used by applications to interact with system abstractions. This interface implies in the Remote scenario aspect being enabled. The ?rst process in this organization is the application loader, which additionally starts the ROI server if the Global Id scenario aspect is enabled. Shared devices are delivered through device servers, which are also created by the loader. Since the loader is an ordinary user-level process, no modi?cations in the initialization procedure are necessary to accomplish this model.



Automatic Con?guration

Epos application-oriented system design enables it to be tailored to match the requirements of particular applications. By specifying con?guration keys, users select the system abstractions, scenario adapters, and hardware mediators that are necessary to support a given application. Composition rules facilitate the task of tailoring Epos, allowing users to specify a reduced number of con?guration keys that are expanded in a coherent system con?guration. Moreover, a user-friendly visual tool can be deployed to carry out this manual con?guration process. However, Epos comprises hundreds of con?gurable elements. Delivering application programmers such a massive repository of components and a mechanism to select and plug them in a component framework may be inadequate. Users can miss the most appropriate system con?guration for a given application simply because there are excessive alternatives. Even if composition rules allow them to specify a reduced number of con?gurable elements, it is possible that signi?cant elements will be relegated to inadequate default con?gurations. Fortunately, application-oriented system design makes it possible to automate the system con?guration process to reduce the occurrence of such incidents. For a component-based operating system like Epos, accomplishing an automated “system factory” is mainly a matter of understanding application requirements. The way Epos associates con?guration keys to abstraction interfaces allows application programmers to specify application requirements about the operation system simply by writing down system invocations. If programmers are not sure about which member of a family of abstractions should be used in a situation, or if they realize the selection may need to be changed in the future, they can use the family’s in?ated interface instead. Afterwards, the application can be submitted to a pipeline of tools that will tailor Epos to ?t it. A schematic representation of the process of automatic con?guring Epos is presented in ?gure 43. The ?rst step consists in performing a syntactical analysis of the application’s source code to identify which system abstractions have been invoked by the application and how they have been invoked. This step is carried out by the analyzer, which generates a preliminary con?guration blueprint consisting of family’s (in?ated) and member’s interfaces.


application?s source code

specified (inflated) interfaces



Configurator info tailored EPOS


scenario aspects

frameworks elements

abstractions and mediators

Figure 43: Epos automatic con?guration tools. The preliminary con?guration blueprint produced by the analyzer is subsequently passed to the con?gurator. This tool consults the catalog of composition rules and the list of abstractions to re?ne the con?guration blueprint. During this re?nement, some in?ated interfaces are bound to satisfy dependencies and constraints. Those that remain unbound are subsequently bound 75

to their “lightest” realizations. The output of this tool is a table of con?guration keys that shapes a framework for the desired Epos con?guration. In order to support the proposed automatic con?guration scheme, Epos abstractions and scenario aspects have been sorted in a cost model corresponding to their intrinsic overhead. This ordering is revealed by the sequence in which abstractions and scenario aspects have been introduced in their families along this chapter, i.e. the Exclusive member of the Thread family has a lower estimated cost than member Cooperative, and the Local member of the Id family has a lower estimated cost than member Global. Figure 44: The dinning philosophers problem in Epos. The last step in the automatic con?guration process is performed by the generator. This tool uses the set of con?guration keys produced by the con?gurator to compile an application-oriented version of Epos. Such Epos instances include only those elements that are necessary to support the execution of associated applications. Moreover, abstractions designated via in?ated interfaces are selected as to minimize the operating system overhead on resource management. An implementation of these con?guration tools as a compiler front-end will be discussed in section 7.3.3. Such implementation allows Epos to be transparently con?gured as the application is compiled. Nevertheless, Epos “operating system factory” cannot guarantee the tailored system to be optimal, even if the application completely delegates con?guration to the presented tools by interacting with the system exclusively via in?ated interfaces. The inaccuracy of the automatic con?guration process arises from the cost criterion used to order abstractions in each family: overhead. The overhead criterion properly represents the cost of most abstractions, but in some cases, leads to non-optimal con?gurations. The processor scheduling policy, for instance, can be wrongly estimated under this criterion. Sometimes the higher inherent overhead of a scheduling policy is compensated by producing a sequence of scheduling events that matches the application’s execution ?ow. Therefore, the set of automatic con?guration tools was designed considering interruptions after each phase, so users can invoke the manual con?guration tool to check and modify the automatically generated con?guration before it is submitted to the generator. 76

For an example of automatic system con?guration, the program in ?gure 44 will be considered. This is a valid Epos implementation of the classic dining philosophers problem [Dij71]. It requires three system abstractions: Thread, Synchronizer, and Console. The analyzer would identify these families, as well as the scope in which they were used, and output a table with the signatures of used operations. Signatures are known by the analyzer from the interfaces supplied in the system header ?les included. The syntactical analysis of this program, ignoring the complex iostream header, would look like the following: global { System::Interface::Synchronizer::Synchronizer(void); System::Interface::Synchronizer::lock(void); System::Interface::Synchronizer::unlock(void); } main { System::Interface::Thread::Thread(int (*)(int), int)^^J } The analysis report reveals that objects of type Synchronizer have been created in the global scope using the default constructor, and that operations lock and unlock have been invoked on them. It also reveals that objects of type Thread have been created in the scope of function main using a constructor that takes a pointer to a function and an integer as parameter. The con?gurator would process this information considering the catalog of composition rules and the list of system abstractions. This would lead the Synchronizer in?ated interface to be bound to Mutex, for it supplies the required operations. Even if Semaphore had declared a default constructor and operations lock and unlock as synonyms for p and v, thus emulating a Mutex, the cost ordering would have made Mutex the best choice. The selection of a realization for the Thread in?ated interface would be based on a special composition rule that requires method pass, which hands processor control over another thread, to be explicitly invoked in order that realization Cooperative is selected. Since the Exclusive realization of Thread does not feature thread creation, member Concurrent would be selected. 77

However, the Concurrent thread implies in the CPU Scheduler abstraction, which in this case would adopt FCFS as the scheduling policy, for it incurs in the lowest overhead. Nonetheless, this decision would only be adequate if the application is executed in a node with ?ve or more processors. Otherwise, some philosophers would never execute. This reveals the limitations of Epos automatic con?guration. Though the number of processors on each node is available to con?guration tools, the decision of bypassing the cost model to select a di?erent scheduling policy cannot be taken automatically. If less than ?ve processors are available in the indicated node, the user would either have to manually correct the con?guration, or modify the ordering of scheduling policies in the cost model to de?ne another default.



EPOS Implementation SNOW Cluster



The Epos system described in previous sections illustrates the deployment of application-oriented system design to engineer the domain of highperformance dedicated computing. That case study helped to corroborate the concepts and techniques of application-oriented system design proposed in [Fr¨01]. However, if Epos is to be accepted as a validation of applicationo oriented system design, its design must also be validated. Although many of Epos design decisions have been well substantiated, a design can hardly be positively evaluated before it is implemented to a signi?cant extent. Therefore, a prototype implementation of Epos for the Snow cluster was carried out in the scope of this dissertation.


Why a Cluster of Workstations?

The idea of clustering commodity workstations as a cost-e?ective alternative to expensive Massively Parallel Processors (MPP) has now been explored for quite a long time. Perhaps one of the ?rst approaches has been the one in which the idle time of ordinary workstations interconnected in a Local Area Network (LAN) was collected to form a computational resource pool that could be used to support the execution of parallel applications [HCG+ 82, Che84]. This idea has been around for almost as long as the LAN itself and undoubtedly helped to open the way for cluster computing. Nowadays, however, cluster computing is better associated with a pile of workstations interconnected in a System Area Network (SAN) and completely dedicated to support the execution of parallel applications. During this natural evolution towards performance, cluster computing has reshaped the concept of commodity computing by triggering the migration of software and hardware concepts from the supercomputer environment to the desktop. Undoubtedly, cost-e?ectiveness has been a key factor to boost the utilization of clusters of workstations to support high-performance computing. However, price/performance is not a metric when the problem to be solved, i.e. the parallel application to be executed, demands performance ?gures that cannot be supplied by a cost-e?ective cluster. That these FLOP famished applications are there, no one questions, but the suggestion that an 79

MPP is the only way to appease them, is increasingly controversial. Recent developments suggest that clusters are going, in the short term, to overcome MPPs also in absolute performance [TFC00, CPL+ 97]. This optimism is not accidental—it arises from simple market laws that favor large-scale production. Improving performance of computational systems demands large e?orts on engineering and manufacturing, which are usually achieved at a very high cost. While MPPs have to share these costs among a few produced units, clusters can share them with the huge workstation market. This phenomenon has been slowing the development of custom hardware, in favor of commodity components, in such a way as to indicate that both technologies, MPPs and clusters, are about to merge. The Intel ASCI Red MPP [San01] is a good example for this merge: its processing elements are ordinary Intel Pentium Pro microprocessors normally found in PCs, and the whole machine has only two non-commodity components in the interconnection system. Although the commodity workstation hardware scene looks bright, the same cannot be stated about commodity software. The run-time support systems running on ordinary workstations have not been designed with parallel computing in mind and usually fail to deliver the functionality needed to execute a parallel application. The commodity solution to this problem is to add a middleware layer, so that features like location transparency, remote invocation, and synchronization are appended to the system. However, this con?guration often fails to achieve the performance level demanded by the parallel application [NAS97, MT00]. Indeed, the need for specialized run-time support systems in order to support parallel computing on clusters of workstations has already been recognized by the fraction of the cluster computing community compromised with high-performance [ABLL92, Fel92, BRB98]. In this scene, modi?cations in the operating system kernel, customizations in run-time support libraries, and specialized middleware implementations are common-practice, with userlevel communication being a frequently explored optimization. This can be observed in projects like U-Net at Cornell University [WBvE96], Fast Messages (FM) at the University of Illinois [PKC97], PM at the Real World Computing Partnership [THIS97], Basic Interface for Parallelism (BIP) at the University of Lyon [PT98], and others. However, run-time support system specializations for cluster computing are usually not customizable with regard to application requirements, fo-


cusing on improving performance by exploring particular architectural features. Especially regarding communication, most solutions comprise a single protocol, disregarding the particular characteristics of each application. In contrast, Epos was designed to be customized according to application needs, delivering abstractions that can be adjusted and combined to produce a large variety of run-time systems. For example, Epos communication subsystem [section 3.4] comprises six families of abstractions and a number of con?gurable features that can be arranged to deliver applications a tailored communication system.


The SNOW Cluster

The Snow cluster was assembled in 1997 at the Research Institute for Computer Architecture and Software Engineering (FIRST) of the German National Research Center for Information Technology (GMD). The cluster consists of 24 processing nodes interconnected with a high-speed network and connected to a server through a service network. These are all o?-the-shelf components that have been separately acquired and locally assembled by the institute sta?. A sketch of the Snow cluster from the perspective of Epos is presented in ?gure 45. The server acts as a front-end to the cluster, from which parallel applications are started and monitored. While running Epos, processing nodes are entirely managed from the server, which triggers a remote boot procedure to start up a tailored Epos for each application session. The server also provides parallel applications running on processing nodes with a persistent storage facility. 7.2.1 Processing Nodes

Snow processing nodes are diskless workstations, each comprising one or two Intel P6 processors, main memory, and network interface cards (NIC) for the computing (Myrinet) and service (Fast Ethernet) networks. A schematic of a Snow node focusing on hardware components that are signi?cant to Epos is shown in ?gure 46. Every hardware platform has its highs and lows. When used for a purpose other than the originally conceived, it is natural that lows become more 81

Server FS manager

Node 0

Node 23


service network





high?speed network

Figure 45: The Snow cluster from the perspective of Epos.



myrinet ethernet PCI




Figure 46: A Snow node from the perspective of Epos. noticeable. Commodity workstations are not constructed to support parallel computing and, when used for this purpose, often fail to exhibit the properties anticipated by applications. Before beginning the implementation of Epos for the Snow cluster, a series of experiments has been performed in order to identify hardware limitations concerning high-performance computing that could be reduced by proper operating system design. These experiments identi?ed two major limitations: 1. Low memory bandwidth: Snow nodes, like most ix86-based workstations, are equipped with low bandwidth memory subsystems. This de?ciency is usually not perceptible to the interactive applications that are traditionally executed on such platforms thanks to a high-bandwidth cache mechanism that hides main memory latency. Nevertheless, a parallel application performing short computation cycles on large data sets can easily over?ow such mechanism, leading the processor to wait for the memory subsystem. 82

2. Interconnect on I/O bus: in contrast to MPP nodes, which usually integrate interconnect mechanisms to the processor/memory path, clusters of commodity workstations rely on ordinary network interface cards plugged in the I/O bus (?gure 46). As consequence, the path to the network is lengthened and the overhead on inter-node communication increased. In the particular case of Snow, the data transfer rate between main memory and the Myrinet adapter is lower than between two Myrinet NICs on di?erent nodes. There is not much an operating system can do to overcome the ?rst limitation. Caring for cache-aligned memory allocation could help, but the operating system cannot enforce cache-aligned memory access (a compiler could). Regarding the second limitation, however, the experiments conducted with Snow helped to design a pipelined communication subsystem that is able to hide a signi?cant part of the latency induced by the hardware architecture. This mechanism will be discussed later in section 7.3.1. 7.2.2 Interconnects

Snow processing nodes are interconnected by two networks: a service network used for interactions with the server, and a computing network used for message exchange between processes of parallel applications. The organization of these networks is depicted in ?gure 47. The service network consists of a full-switched, full-duplex Fast Ethernet network with a Giga Ethernet link to the server. It was con?gured to give each node an equally wide communication channel with the server. The computing network consists of a Myricon Myrinet high-speed network [BCF+ 95]. Myrinet has been chosen as the computing network for Snow because its interfaces and protocols are open and published, and its network interface cards are programmable. Indeed, the Myrinet NICs that equip Snow nodes (?gure 48) are embedded computing system, with own processor and memory. The LANai processor on Myrinet NICs can be programmed to release the main processor from most communication-related tasks. Two architectural characteristics of Myrinet NICs are of special interest for an operating system project: 83




Caption Fast Ethernet Switch internal Giga Ethernet Myrinet Processing node Server


Figure 47: The organization of Snow networks.
link int. packet int.




host int.


Figure 48: The Myrinet network interface card from the perspective of Epos. ? The memory on the NIC can be shared with the main processor, yielding an asymmetric multiprocessor system; ? The DMA engines used to send and receive data to/from the network and to transfer data between host and NIC can operate in parallel. The asymmetric multiprocessor con?guration enables the network to be modeled as a bounded bu?er, with application processes on the host producing 84

outgoing messages that are consumed by a process on the Myrinet NIC. Incoming messages are handled the other way round, with a Myrinet process playing the role of producer and application processes on the host playing consumers. This asynchronous communication model, in combination with the parallel operation of DMA engines, allows a message (or a piece of a message) to be transferred between host and Myrinet at the same time another one is being transferred over the network, thus sustaining a communication pipeline


EPOS Implementation

Aiming at validating the domain engineering described in this article, a prototype of Epos was implemented for the Snow cluster. Emphasized were the software engineering techniques introduced by application-oriented system design, such as adaptable abstractions, scenario aspects, in?ated interfaces, and component frameworks. Classic operating system concepts, which have already been implemented in numerous other systems, were given a lower priority. In particular, no support for multiprocessor nodes has been included in this prototype. The implementation of Epos prototype for Snow was conducted as a last re?nement of design, rather than an isolated phase with sporadic meeting points. Constant feedback helped to enhance design speci?cations while yielding better support for implementation decisions. This scheme, which reassembles eXtreme Programming (XP) [Bec99], was only feasible because of the solid domain decomposition guided by application-oriented system design. Cycling design to extend or correct the functional speci?cation of individual abstractions or include new abstractions does not have major consequences on already implemented entities. In fact, conducting a design to its ?nest level independently of implementation experiments is seldom viable. However, a sloppy domain decomposition would have been exposed to deeper modi?cations, perhaps requiring abstractions and scenario aspects to be repartitioned, thus compromising the software development process. Epos prototype for Snow was mostly implemented in standard C++, with some few hardware mediator methods written in assembly. Two versions of the prototype have been produced: one runs “natively” on Snow, and the other as a “guest operating system” on Linux. From the point of 85

view of applications, both versions are functionally equivalent, though the Linux-guest implementation su?ers performance impairments due to Linux memory and process management strategies. Speci?cally, there is no way to disable multitasking and scheduling on Unix-like systems, consequently a?ecting the execution of dedicated applications that do not need such features. Several of the Epos abstractions described in section 3 have been implemented for Snow, including representatives of all modeled families. Particularly interesting in the context of application-orientation, were those minimalist abstractions unavailable in general-purpose operating systems, such as the Flat Address Space, the Exclusive Task, and the Exclusive Thread. The Flat Address Space has a practically empty implementation, since the con?guration of the ix86’s MMU to simulate a ?at address space is performed by the setup utility21 . The abstraction only features operations to assert the status of the memory subsystem and to modifying memorymapping modes (through the methods of the MMU hardware mediator). The Exclusive Task is also mostly realized by the setup utility, which loads the task’s code segment in a ready-only portion of the address space, assigning the remaining memory to the task’s data segment. The Exclusive Thread implementation consists basically of status and termination operations. If the Alarm abstraction is not explicitly instantiated by the application, timer interrupts remain disabled, since processor scheduling is not necessary for this abstraction22 . Together with Flat Address Space and Exclusive Task, Exclusive Thread is able to produce an Epos instance of size zero: after hardware and system initialization, absolutely all resources are delivered to the application process. The decomposition guided by application-oriented system design resulted in relatively simple implementations even for Epos most complex abstractions. For instance, a large number of entities pertaining the implementation of the Paged Address Space, Mutual Task, and Concurrent Thread abstractions have been separately implemented as hardware mediators and scenario aspects. Implementing the Paged Address Space abstraction withCompletely disabling the MMU of a ix86 processor is not possible, so the setup utility has to arrange for a mapping in which logical and physical addresses match. 22 Time operations performed with Clock and Chronometer do not depend from timer interrupts.


out having to consider the idiosyncrasies of ix86’s MMU (and vice-versa) was undoubtedly more e?ective than implementing a monolithic abstraction. Epos scenario aspects [section 4] have been modeled relying on the ability of scenario adapters to autonomously enforce a scenario (i.e. a composition of structural and behavioral scenario aspects) to abstractions, thus eliminating eventual dependencies from the component framework. Consequently, scenario aspects could be implemented as autonomous objects that are aggregated to form a Scenario object. This is subsequently applied to abstractions by the Adapter framework element (see ?gure 33 in section 5.1). Accomplishing an implementation of Epos component framework metaprogram [section 5.1] that respect the autonomy of system abstractions and scenario aspects, i.e. that does not require them to incorporate speci?c framework elements, was the most time-demanding activity during the implementation of Epos prototype. A precise realization of the design described in section 5.1 as a set of C++ templates could only be accomplished after a long streamline process. This process exposed many pitfalls of static metaprogramming in C++, especially concerning maintainability23 . Achieving the expected functionality was not problematic, but the ?rst versions of the metaprogram were almost unintelligible. Nevertheless, the resultant framework metaprogram, besides being able to handle independently de?ned abstractions and scenario aspects, has a null intrinsic overhead and is relatively easy to maintain. 7.3.1 The Myrinet Network Abstraction

A distinguished abstraction implemented for the Snow cluster is the Myrinet member of the Network family. This abstraction was ?rstly implemented for the guest version of Epos focusing on pipelining the communication system [Tie99]. Subsequently, a native version emphasizing aspects of application-oriented system design was conducted [FTSP00]. A design diagram of the Myrinet abstraction is depicted in ?gure 49. Myrinet uses the Myrinet NIC member of the Device family to realize the uniform in?ated interface of the Network family. The Myrinet NIC device abstraction deploys mechanisms provided by its family common package to

The framework metaprogram is not exposed to end users.


map the control registers and the memory of a Myrinet NIC to the address space of the main processor. In this way, the state of a Myrinet NIC instance matches the state of the corresponding physical NIC24 . Myrinet subsequently uses Myrinet NIC to realize a member of the Network family of abstractions, de?ning methods to send and receive packets of data to peer nodes on the Myrinet system area network.
ordering flow_control reliability broadcast multicast





Figure 49: Epos Myrinet abstraction. A con?gurable feature is more than a ?ag to control the inclusion of an optional family feature; it also designates a generic implementation of the feature that can be reused by family members. With regard to Myrinet, the ordering con?gurable feature is permanently enabled, since the source routing strategy used by Myrinet implicitly ensures in-order delivery. However, the generic implementations of con?gurable features multicast and broadcast would be reused, since Myrinet does not support them by design. Nevertheless, a Myrinet NIC has its own processor and could itself implement the hardware support that is missing to deliver the con?gurable features de?ned for the Network family. Therefore, none of the generic implementations has been reused by Myrinet, which relies on processes running on the NIC to realize them. Indeed, most of the functionality of the Myrinet abstraction is delivered by such processes, which perform in an asymmetric multiprocessor scenario with processes on the host. The communication strategy implemented by the Myrinet abstraction consists in writing message descriptors on the memory shared by NIC and
The setup utility pre-allocates one instace of Myrinet NIC for each Myrinet NIC installed in the node.


host, designating the address and length of outgoing messages and incoming message bu?ers, and signalizing a process on the NIC to perform a message exchange. Two processes run on the NIC in behalf of Myrinet: sender and receiver.
Host NIC Network

prepare message [ send ] sending


waiting descriptor interpret descriptor

write descriptor waiting completion [ sent ]


host DMA

prepare header

Caption start end object state activity

waiting DMA


signalize completion

network DMA

network packet

Figure 50: Activity diagram of Epos Myrinet sender process. The Myrinet sender process (see the activity diagram in ?gure 50) waits for a message descriptor to be signalized ready and then autonomously processes the outgoing message. It ?rst starts a DMA transaction to fetch the message from main memory. Meanwhile, it generates a header for the outgoing message based on the information present in the descriptor. When the completion of the DMA transfer is signalized, a new DMA transaction is started to push the message, now with a header, into the network. Finally, 89

the descriptor is updated to signalize that the message has been sent. The Myrinet receiver process operates complementarily to sender. It is activated by interrupt whenever a message arrives25 , extracting it from the network with a DMA transaction and storing it in a local bu?er. Subsequently, receiver checks for a message descriptor indicating that a process on the host is waiting for the message. If one is available, it initiates a second DMA transaction to move the message to the main memory location indicated by the descriptor, otherwise the sender process will perform this step later when a receive descriptor becomes available. On the eminence of bu?er over?ow, a ?ow control mechanism is activated if the flow control con?gurable feature is enabled, otherwise messages are dropped. This communication strategy was conceived considering the Flat member of the Address Space family [section 3.1], for which logical and physical addresses do match, since the LANai processor on the Myrinet NIC does not know of the address translation scheme on the host. If the Paged Address Space is used, or if Epos is running as guest on Linux, messages can be scattered across the memory in frames that cannot be directly identi?ed from their logical address. Therefore, an intermediate copy to a contiguously allocated system bu?er, of which the physical address is known, has to be performed on the host. A message exchange over Myrinet, including the additional copies, is depicted in ?gure 51. The ability of a Myrinet NIC to simultaneously perform two DMA transfers (each processor cycle comprises two memory cycles) was explored to transform the steps represented in ?gure 51 in stages of a communication pipeline. Messages are thus broken in small packets that move through the pipeline in parallel. In this way, it is possible to start sending a message over the network while it is still being fetched from main memory. When the pipeline is full, stages 2 and 3.1, as well as stages 4 and 3.2, are overlapped. The delay between stages 3.1 and 3.2, i.e. the physical network latency, is much smaller than the time spent by packets on other pipeline stages, hence both stages can be merged to yield a single stage 3. If the copy stages 1 and 5 are necessary, message descriptors are assigned on per-packet basis, so the processes on the Myrinet NIC are not forced to wait for the copies to be completed. However, stages 1 and 2, as well as stages 5 and 4, concur for the
The LANai processor on Myrinet NICs is a dual-context processor, with context exchange being automatically triggered by interrupts.


Sender application memory message
(1) copy

Receiver application memory message
(5) copy


system memory


system memory

(2) host ?> myrinet DMA

(4) myrinet ?> host DMA


Myrinet NIC


Myrinet NIC

(3.1) send DMA

(3.2) receive DMA

Myrinet network

Figure 51: A message exchange with Myrinet. same physical resource, namely memory, and will seldom overlap. Furthermore, the proposed communication pipeline has a critical parameter: the packet size. Adopting an excessively large packet size increases the pipeline’s setup time, while excessively small packets exacerbate the pipeline’s internal overhead. Furthermore, the in?uence of the packet size on the pipeline depends on the length of the messages being transmitted. In order to estimate the optimal packet size as a function of the message length, the bandwidth of each pipeline stage and the latency experienced by packets on each stage were obtained. Applying a linear cost model, the optimal packet size for di?erent ranges of message lengths was calculated [Tie99]. The result was used to implement an adaptive communication pipeline that automatically switches to the best packet size for each message. Although the communication pipeline has a low intrinsic overhead, programming DMA controllers and synchronizing pipeline stages may be more onerous than using programmed I/O for messages shorter than a certain length. Therefore, the pipeline is bypassed for messages shorter than a de?nable threshold (256 bytes by default in the current implementation). The performance of the Myrinet Network abstraction implemented for the Snow cluster was assessed sending messages of increasing length from one node to another. Both the ix86-native (no copy) and the Linux-guest (copy) versions were considered. Figures 52 and 53 show respectively the 91

bandwidth and the latency observed during the experiment. A bandwidth of 116 Mbytes/s for 64 Kbytes messages represents roughly 90% of the bandwidth available to transfer data between main memory and the Myrinet NIC over the 32 bits/33 MHz PCI bus that equips Snow nodes.
120 Ix86-native EPOS (single-task) Linux-guest EPOS (multi-task)

100 bandwidth (Mbytes/s)





0 4 16 64 256 1K message size (bytes) 4K 16K 64K

Figure 52: One-way bandwidth achieved by the Myrinet Network abstraction on Snow. Figure 52 makes evident the contention for memory access between pipeline stages 1 (copy) and 2 (host to NIC DMA) in the Linux-guest implementation. The ix86-native version, which does not use stage 1, shows a growing advantage over the guest version as the message length increases. Even if large messages are split in small packages, the bursts of memory transactions performed by both stages exceed the memory subsystem capacity. This phenomenon is direct evidence that apparently harmless features of conventional operating systems that are enabled by design (multitasking support in this case) do indeed impact applications that do not need them.


1000 iX86-native EPOS (single-task) Linux-guest EPOS (multi-task)

100 time (us) 10 1 4 16 64 256 1K 4K message size (bytes) 16K 64K

Figure 53: One-way latency measured for the Myrinet Network abstraction on Snow. 7.3.2 Utilities

Epos setup utility [section 5.2.1] was implemented for Snow taking advantage of the con?guration capabilities of the service network. For ease of maintenance, Snow computing nodes do not have disks. Hence, booting is accomplished by each node downloading a boot image from the server over the service network. This boot image includes a node-speci?c description of Epos—which among other con?guration information contains the node’s logical id—and the executable image of the ?rst process. In this manner, each node can be initialized with a distinct application (and consequently a distinct con?guration of Epos) or with distinct processes of a task-parallel application. The initialization of data-parallel applications in a Single Program, Multiple Data (SPMD) con?guration is accomplished with a single boot image that is downloaded from the server by all nodes. Subsequently, the setup utility replaces the in-image description of Epos with a node-speci?c ver93

sion obtained from the server using the Dynamic Host Con?guration Protocol (DHCP) [Dro93] capabilities of the service network. Therefore, users can select the initialization strategy according to the needs of each application. Both setup and init utilities expect executable images, including the operating system, to be in the Executable and Linking Format (ELF) [Int95b]. This enables ordinary compilation and linking tools on an ix86-based Linux workstation to be used as a cross-compiling environment for Epos. The separation of the init utility [section 5.3.1] from the operating system image is accomplished using these tools. The init method of every abstraction is de?ned in a separate compilation unit that is linked with the init utility. Unde?ned references to Epos addresses are then resolved by the linker consulting (but not linking) Epos ELF image. 7.3.3 Tools

A simpli?ed version of Epos syntax analyzer was implemented for Snow. This analyzer is able to identify abstractions that are necessary to support the execution of applications written in C++. It invokes the compiler indicated by the user to compile the application with all in?ated interfaces left unbound. If libraries are supplied, incremental linking is performed in order to collect further references to Epos abstractions contained in those libraries. The output of this compilation/linking process is an object ?le whose symbol table contains unde?ned references to Epos abstractions designated by the application (directly or indirectly). The symbol table is subsequently manipulated by the analyzer to produce a list of operations invoked for each Epos abstraction. Abstractions that are directly referenced, i.e. that are not referenced through the in?ated interface of their families, are also included in the analysis to substantiate further con?guration decisions. For example, an application can directly designate a system abstraction that constrains the binding of other in?ated interfaces. Submitting the dinning philosophers program presented in section 6 to the syntax analyzer produces the following results: Synchronizer { constructor(void); lock(void); 94

unlock(void); } Thread { constructor(int (*)(int), int); } In comparison to a full-?edged Epos analyzer, this implementation misses detailed scope information. Instead of functions, compilation units are used as scope delimiters. In the example above, the information that Synchronizer was instantiated in the global context, and that Thread was instantiated in the context of function main was lost. Nevertheless, using the compiler to parse applications on behalf of Epos analyzer brings two important advantages: ?rst, con?icts between analyzer and compiler due to subjective parsing strategies are eliminated; second, the syntactical information output by compilers on object ?les represents the input program after the execution of eventual static metaprograms (i.e. templates are instantiated, static constant data members and inline member functions are resolved). It would be extremely di?cult for an autonomous syntax analyzer to deliver these characteristics, especially with regard to static metaprograms that optimize the input program. An ideal implementation of Epos analyzer could be obtained with a compiler that output the parse tree and the ?ow graph of programs as they are compiled. Such a compiler would retain the qualities of the current implementation while attaining more detailed scope information. Unfortunately, none of the compilers available for Snow o?ers this feature. The results of the syntactical analysis are subsequently processed by Epos con?gurator. This tool relies on a catalog of composition rules to select the components that better match the requirements collected by the analyzer. This catalog embeds a list of existing Epos abstractions and a description of the target platform that serve as additional constraints for the con?guration procedure. In the current implementation, composition rules are represented in the eXtensible Markup Language (XML) [W3C98]. An associated Document Type De?nition (DTD) enables a ?exible use of the rules catalog by diverse tools. Some fragments of the rule catalog concerning the dinning philosophers program analyzed earlier are depicted in ?gure 54. Families of abstractions 95

are declared with the family element, while family members are declared with the member element in the scope of their families. Attribute default of the family element designates the family member to be used in case the con?gurator does not have enough arguments to make a selection. The order in which members are declared in a family designates their cost, the ?rst being the cheapest. Setting the type attribute of a member element to “Sole” indicates that the member cannot be used simultaneously with other family members. In the case of Thread, however, the type attribute of family set to “Incremental” allows the con?gurator to replace a cheaper member with a more expensive one. For example, if a compilation unit is satis?ed with the “Exclusive” member, but another requires “Cooperative”, the latter is used for the entire application. Figure 54: Fragments of Epos catalog of composition rules. Both family and member elements can be assigned composition rules in the form described in section 5.1.2. Attributes pre and pos are used for this purpose. In ?gure 54, member “Exclusive” of the “Thread” family depends on the member with the same name from the “Task” family, and family “CPU Scheduler” requires the “Concurrent” member of the “Thread” family. An interpreter for this con?guration language was implemented in the context of a graphical front-end for Epos con?guration tools [R¨m01]. Beo sides being able to guide the process of tailoring Epos to a particular application, this visual tool delivers a feature-based interface to manually con?gure Epos. It shall be extended in the future to accomplish a management console for Snow, from which hardware and software maintenance tasks will be performed. The interfaces of the abstractions listed in the catalog of composition rules are automatically collected in a catalog of system interfaces applying the syntax analyzer to Epos component repository. A stretch of Epos catalog of system interfaces is depicted in ?gure 55. The special interface Framework comprises operations that are implicitly supplied by the component framework to all abstractions. This entity is manually inserted in the catalog. In order to select the best realization for each in?ated interface referenced by the application, the con?gurator crosses the output of the syntactical anal96

Figure 55: Fragments of Epos catalog of system interfaces. ysis with the catalog of interfaces. Realizations with interfaces that contain the required signatures are probed considering the order speci?ed in the catalog of composition rules. When one is found that matches the required signatures, the respective composition rules are checked. If no con?icts are detected, the in?ated interface is bound to that realization, otherwise, the search continues. The impossibility to ?nd an adequate realization reveals either an application ?aw, or a member of a dissociated family that has not yet been implemented. Regarding the dinning philosophers program, Synchronizer would be bound to the cheapest realization that features a default constructor and includes methods lock and unlock, that is, Mutex. In turn, the Thread in?ated interface would be bound to Concurrent, since Exclusive does not provide the required constructor. As explained in section 6, the Cooperative member of the Thread family has a lower overhead than Concurrent, but it is solely elected by method pass, which is not realized by Concurrent (this was arranged reversing the order of both abstractions in the catalog of composition rules). A con?guration produced by Epos con?gurator consists of selective realize keys, which designate the binding of in?ated interfaces of abstractions and scenario aspects, and con?gurable feature keys that designate the con?gurable features that must be enabled for abstractions and scenario aspects. This set of keys is translated by Epos generator to typedefs and Traits structures that control the operation of the component framework metaprogram throughout the compilation of an Epos instance. Indeed, the generator cannot be considered separately from the compiler, since actual generative tasks, such as the adaptation of abstractions to scenarios and the composition of abstractions, are performed by the compiler as it executes the metaprogram.




Epos (Embedded Parallel Operating System) is an experimental applicationoriented operating system developed in the scope of this dissertation to validate concepts and techniques of application-oriented system design. The domain envisioned by Epos is that of high-performance dedicated computing, which comprises applications that, besides running with exclusivity on the respective platforms, require an e?cient management of resources. This domain comprises mainly embedded and parallel applications. In order to cope with the steady evolution of the envisioned domain, Epos established an open and continuous domain analysis process that allows new entities to be included in the design as they are identi?ed in the domain. Domain entities were modeled aiming at high scalability and reusability, so that Epos can be tailored to particular applications. Abstractions are mostly independent of each other, of execution scenario aspects, and of component frameworks. Consequently, they can be extensively reused in a variety of scenarios. Furthermore, Epos component framework can be adjusted to accommodate forthcoming abstractions, or to build particular software architectures, without a?ecting existing abstractions. Epos captures the entities in the domain of high-performance dedicated computing with application-oriented abstractions that realize mechanisms concerning the management of memory, processes, coordination, communication, time, and I/O. Besides, a remote object invocation mechanism allows for external abstractions, such as ?les and graphical display, to be transparently accessed. Entities with strong architectural dependency, such as node and CPU, were separately modeled as non-portable hardware mediators that realize a portable operating system interface. Epos models the following properties of domain entities as scenario aspects: identi?cation, sharing, allocation, protection, timing, atomicity, remote invocation, debugging, and pro?ling. An execution scenario for abstractions is shaped by combining suitable scenario aspects, which are transparently applied to abstractions via scenario adapters. Epos component framework captures elements of reusable system architectures, de?ning how abstractions can be composed. It was modeled as a collection of interrelated scenario adapters that build a “socket board” for system abstractions and scenario aspects, which are “plugged” via in?ated 98

interface binding. The framework is realized by a static metaprogram and a set of externally de?ned composition rules. The metaprogram is responsible for adapting abstractions to the selected execution scenario and arranging them together during the compilation of an application-oriented instance of Epos, while the rules coordinate the operation of the metaprogram, specifying constraints and dependencies for the composition of abstractions. The scalable software architecture accomplished by the component framework is sustained by the setup and init utilities. The former isolates architectural dependencies concerning hardware initialization, while the latter isolates system initialization procedures. In cooperation with the framework metaprogram, these utilities are able to deliver an Epos instance either as a ?-kernel or fully embed it in a single-tasked application. Epos design allows for automatic con?guration and generation. Abstraction and family (in?ated) interfaces referred to by an application can be collected by Epos analyzer to yield a blueprint for the system that has to be generated. This blueprint is subsequently processed by Epos con?gurator, which consults a catalog of composition rules and a list of existing system abstractions to con?gure the component framework according to application needs. At last, Epos generator compiles an instance of Epos including the necessary abstractions, mediators, and scenario aspects. Clustering commodity workstations as a cost-e?ective alternative to expensive massively parallel processors has been explored for quite a long time. Recently, improvements on commodity microprocessors and interconnects begun to make clusters competitive also in absolute performance, suggesting that both hardware technologies are about to merge. Regarding software, however, the dedicated run-time support systems used on MPPs show considerable advantages over the commodity workstation operating systems typically used on clusters. In order to compensate the di?erence, commodity systems are often patched with specialized subsystems, in particular userlevel communication. However, the in?exible structure of such operating systems severely restricts optimizations. The prototype implementation of Epos, which aims ?rstly at verifying design decisions, was strongly motivated by the possibility of introducing an application-oriented operating system to the high-performance cluster computing scene, hence the decision of implementing it for the Snow cluster of workstations. This cluster consists of 24 Intel ix86-based processing nodes in99

terconnected with a Myrinet high-speed network and connected to a server through an Fast Ethernet service network. The implementation of Epos for Snow was conducted as a last re?nement of design, rather than an isolated phase. Constant feedback helped to enhance design speci?cations while yielding better support for implementation decisions. Abstractions, hardware mediators, scenario aspects, component framework, setup and init utilities, and con?guration tools were implemented to produce two versions of the prototype: one that runs “natively” on Snow, and the other that runs as a “guest operating system” on Linux. The implementation of the Myrinet Network abstraction bene?ted from the LANai processor on the Myrinet interface card to shape an asymmetric multiprocessor con?guration with the main processor, enabling the network to be modeled as a bounded bu?er. In order to communicate, application processes write message descriptors on the memory shared by both processors. Messages are then autonomously processed by the portion of Myrinet Network abstraction that runs on the NIC, which pro?ts from the parallel operation of Myrinet hardware components to implement a communication pipeline. The syntax analyzer implemented for Snow is able to identify Epos abstractions that are needed to support the execution of applications written in C++. It uses the compiler indicated by the user to compile the application with all in?ated interfaces left unbound, thus producing a symbol table with unde?ned references to Epos abstractions. The symbol table is subsequently processed to produce a list of the operations invoked for each abstraction. The catalog of composition rules used by Epos con?gurator was written in XML, with and associated DTD supporting a ?exible use by diverse tools. By consulting this catalog, which embeds the relative cost of abstractions, the con?gurator is able to select the components that better match the requirements collected by the analyzer. The output of this con?guration process are selective realize and con?gurable feature keys, which are subsequently translated in typedefs and Traits structures by the generator. The fact that Epos generator does not manipulate the source code of abstractions and scenario aspects directly—the component framework metaprogram does it in behalf of the generator—allowed this simple implementation. However, working together, generator and framework are able to adapt and compose selected abstractions to produce application-oriented instances. 100

[ABLL92] Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska, and Henry M. Levy. Scheduler Activations: E?ective Kernel Support for the User-level Management of Parallelism. ACM Transactions on Computer Systems, 10(1):53–79, February 1992. Maurice J. Bach. The Design of the UNIX Operating System. Prentice-Hall, 1987. Nanette J. Boden, Danny Cohen, Robert E. Felderman, Alan E. Kulawik, Charles L. Seitz, Jakov N. Seizovic, and Wen-King Su. Myrinet: A Gigabit-per-Second Local-Area Network. IEEE Micro, 15(1):29–36, February 1995. Kent Beck. Extreme Programming Explained: Embrace Change. Addison-Wesley, 1999. Andrew D. Birrell and Bruce Jay Nelson. Implementing Remote Procedure Calls. ACM Transactions on Computer Systems, 2(1):39–59, February 1984. Grady Booch. Object-Oriented Analysis and Design with Applications. Addison-Wesley, 2 edition, 1994. Raoul A. F. Bhoedjang, Tim R¨hl, and Henri E. Bal. Useru level Network Interface Protocols. IEEE Computer, 31(11):53– 60, November 1998. David R. Cheriton. The V Kernel: A Software Base for Distributed Systems. IEEE Software, 1(2):19–43, April 1984. A. Chien, S. Pakin, M. Lauria, M. Buchanan, K. Hane, L. Giannini, and J. Prusakova. High Performance Virtual Machines (HPVM): Clusters with Supercomputing APIs and Performance. In Proceedings of the Eighth SIAM Conference on Parallel Processing for Scienti?c Computing, 1997. Lu? Vaz de Cam?es. Os Lusiadas. Imprensa Regia, Lisbon, ?s o Portugal, 1572. 101

[Bac87] [BCF+ 95]

[Bec99] [BN84]

[Boo94] [BRB98]

[Che84] [CPL+ 97]



Jack B. Dennis. Segmentation and the Design of Multiprogrammed Computer Systems. Journal of the ACM, 12(4):589– 602, October 1965. Edsger Wybe Dijkstra. Cooperating Sequential Processes. Technical Report EWD-123, Technical University of Eindhoven, Eindhoven, The Netherlands, 1965. Edsger Wybe Dijkstra. Hierarchical Ordering of Sequential Processes. In Operating Systems Techniques, pages 72–93. Academic Press, 1971. Ralph Droms. Dynamic Host Con?guration Protocol. Bucknell University, Lewisburg, U.S.A., October 1993. RFC 1531. Ant?nio Augusto Fr¨hlich, Rafael B. Avila, Luciano Piccoli, and o o Helder Savietto. A Concurrent Programming Environment for the i486. In Proceedings of the 5th International Conference on Information Systems Analysis and Synthesis, Orlando, USA, July 1996. E. W. Felten. The Case for Application-speci?c Communication Protocols. In Proceedings of Intel Supercomputer Systems Technology Focus Conference, pages 171–181, April 1992. Ant?nio Augusto Fr¨hlich. Application-Oriented Operating o o Systems. Number 17 in GMD Research Series. GMD Forschungszentrum Informationstechnik, Sankt Augustin, August 2001. Ant?nio Augusto Fr¨hlich, Gilles Pokam Tientcheu, and Wolfo o gang Schr¨der-Preikschat. EPOS and Myrinet: E?ective Como munication Support for Parallel Applications Running on Clusters of Commodity Workstations. In Proceedings of 8th International Conference on High Performance Computing and Networking, pages 417–426, Amsterdam, The Netherlands, May 2000.



[Dro93] [FAPS96]


[Fr¨01] o


[HCG+ 82] K. Hwang, W. J. Croft, G. H. Goble, B. W. Wah, F. A. Briggs, W. R. Simmons, and C. L. Coates. A Unix-based Local Com102

puter Network with Load Balancing. IEEE Computer, 15(4):55– 66, April 1982. [HK61] David J. Howarth and Tom Kilburn. The Manchester University Atlas Operating System, Part II: User’s Description. Computer Jorunal, 4(3):226–229, October 1961. Charles Anthony Richard Hoare. Monitors: An Operating System Structuring Concept. Communications of the ACM, 17(10):549–557, October 1974. Intel. Pentium Pro Family Developer’s Manual, December 1995. Intel. Tool Interface Standard: Executable and Linking Format Speci?cation (version 1.2), May 1995. International Organization for Standardization. Open Systems Interconnection - Basic Reference Model, August 1981. ISO/TC 97/SC 16 N 719. Anita K. Jones and Peter M. Schwarz. Experience Using Multiprocessor Systems - A Status Report. ACM Computing Surveys, 12(2):121–165, June 1980.


[Int95a] [Int95b] [ISO81]


[KHPS61] Tom Kilburn, David J. Howarth, R.B. Payne, and Frank H. Sumner. The Manchester University Atlas Operating System, Part I: Internal Organization. Computer Jorunal, 4(3):222–225, October 1961. [Lie96] [LT91] Jochen Liedtke. On Microkernel Construction. Operating Systems Review, 29(5):237–250, January 1996. Henry M. Levy and Ewan D. Tempero. Modules, Objects, and Distributed Programming: Issues in RPC and Remote Object Invocation. Software — Practice and Experience, 21(1):77–90, January 1991. Sape J. Mullender and Andrew S. Tanenbaum. The Design of a Capability-based Distributed Operating System. The Computer Journal, 29(4):289–300, 1986.




University of Mannheim and University of Tennessee. TOP500 Supercomputer List, online edition, November 2000. []. Numerical Aerospace Simulation Systems Division. The NAS Parallel Benchmarks, online edition, November 1997. []. Elliott Organick. The Multics System: an Examination of its Structure. MIT Press, Cambridge, U.S.A., 1972. Scott Pakin, Vijay Karamcheti, and Andrew A. Chien. Fast Messages: E?cient, Portable Communication for Workstation Clusters and Massively-Parallel Processors. IEEE Concurrency, 5(2), June 1997. Loic Prylli and Bernard Tourancheau. BIP: a New Protocol Designed for High Performance Networking on Myrinet. In Proceedings of the International Workshop on Personal Computer based Networks of Workstations, Orlando, USA, April 1998. Sascha R¨mke. Ein XML-basiertes Kon?gurationswekzeug f¨r o u Betribssysteme am Baispiel EPOS. Studienarbeit, Otto-vonGuericke-Universit¨t, Magdeburg, Germany, 2001. a Sandia National Laboratories. SuperComputer, online edition, []. The ASCI Red March 2001.


[Org72] [PKC97]


[R¨m01] o



Abraham Silberschatz, Peter Galvin, and James Peterson. Operating Systems Concepts. John Wiley and Sons, ?fth edition, 1998. Claude Elwood Shannon. A Mathematical Theory of Communication. Bell System Technical Journal, 27:379–423 and 623–656, July 1948. Wolfgang Schr¨der-Preikschat. PEACE - A Software Backplane o for Parallel Computing. Parallel Computing, 20(10):1471–1485, 1994.





Wolfgang Schr¨der-Preikschat. The Logical Design of Paralo lel Operating Systems. Prentice-Hall, Englewood Cli?s, U.S.A., 1994. Andrew S. Tanenbaum. Modern Operating Systems. PrenticeHall, 1992. IEEE Task Force on Cluster Computing. Cluster Computing White Paper, Mark Baker, editor, online edition, December 2000. []. Andrew Tucker and Anoop Gupta. Process Control and Scheduling Issues for Multiprogrammed Shared-Memory Multiprocessors. Operating System Review, 23(5):159–166, December 1989. Hiroshi Tezuka, Atsushi Hori, Yutaka Ishikawa, and Mitsuhisa Sato. PM: An Operating System Coordinated High Performance Communication Library. In High-Performance Computing and Networking, volume 1225 of Lecture Notes in Computer Science, pages 708–717. Springer, April 1997. Ken Thompson. UNIX Implementation. The Bell System Technical Journal, 57(6):1932–1946, July 1978. Gilles Pokam Tientcheu. Communication Strategies to Support Medium/Fine Grain Parallelism on SMP Workstations Clustered by High Speed Networks. Diplomarbeit, Technical University of Berlin, Berlin, Germany, December 1999. Josep Torrellas, Andrew Tucker, and Anoop Gupta. Bene?ts of Cache-A?nity Scheduling in Shared-Memory Multiprocessors: a Summary. ACM Performance Evaluation Review, 21(1):272– 274, June 1993.

[Tan92] [TFC00]



[Tho78] [Tie99]


[vECGS92] Thorsten von Eicken, David E. Culler, S. C. Goldstein, and Klaus Erik Schauser. Active Messages: a Mechanism for Integrated Communication and Computation. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 256–266, Gold Coast, Australia, May 1992.



World Wide Web Consortium. XML 1.0 Recommendation, online edition, February 1998. [].

[WBvE96] Matt Welsh, Anindya Basu, and Thorsten von Eicken. Lowlatency communication over fast ethernet. In Euro-Par 1996, volume 1123 of Lecture Notes in Computer Science, pages 187– 194, Lyon, France, August 1996. Springer. [ZRS87] Wei Zhao, Krithi Ramamritham, and John A. Stankovic. Scheduling Tasks with Resource Requirements in Hard RealTime Systems. IEEE Transactions on Software Engineering, 13(5):564–577, 1987.



Contents List of Figures vii List of Tables ix List of Algorithms x
TABLE OF CONTENTS List of Figures iv List of Tables vi
TABLE OF CONTENTS Table of Contents ii List of Figures and Tables iv
table of contents & list of figures and tables
List of Figures, x List of Tables, xii List of Tables, xii
Contents List of Figures List of Tables Index
Contents List of Figures
Contents List of Figures ix List of Tables xi
TABLE OF CONTENTS LIST OF TABLES................................... vii