3231note


UNSW COMP3231 23T1 notes

Lec 02 Processes and Threads

Processes:

• Also called a task or job

• Memory image of an individual program

• “Owner” of resources allocated for program execution

• Encompasses one or more threads

Threads:

• Unit of execution 执行命令的单位

• Can be traced

• list the sequence of instructions that execute

• Belongs to a process

• Executes within it.

One Process may contain one or more threads.

Process Termination

Conditions which terminate processes

1. Normal exit (voluntary)

2. Error exit (voluntary)

3. Fatal error (involuntary)

4. Killed by another process (involuntary)

Process and Thread States

Running

Blocked

Ready

Running → Ready

• Voluntary Yield()

• End of timeslice

Running → Blocked

• Waiting for input

• File, network,

• Waiting for a timer (alarm signal)

• Waiting for a resource to become available

Schedular

Sometime called as dispatcher, it will choose a ready process to run.

BUT: It is inefficient to search through all processes

e.g: Ready Queue & Blocked Queue

Thread Model

Per process items Per thread items

Address space

Global variables

Open files

Child processes

Pending alarms

Signals and signal handlers

Accounting information

Program counter

Registers

Stack (Each thread has its own stack)

State

Local variables are per thread (allocated on the stack)

Global variables are shared between all threads (Allocated in the data section)

Concurrency control is an issue.

Dynamically allocated memory can be global or local -> Program defined

Thread Usage

Model 模型 Characteristics 特性
Threads Parallelism. Blocking system calls
Single-threaded Process No parallelism, blocking system calls
Finite-state Machine Parallelism, nonblocking system calls, interrupts

Summary in Threads

  1. Simpler to program than a state machine
  2. Less resources associated with threads than multiple complete processes
    1. Cheaper to create and destroy
    2. Share resources (especially memory) between them
  3. Performance: Threads waiting for IO can be overlapped with computing threads
    1. If all threads are compute bound(被运算速度限制), then there is no performance improvement (on uniprocessor)
  4. Threads can take advantage of the parallelism available on machines with more than one CPU (multiprocessor)

Lec 03 Concurrency and Synchronisation

Concurrency Example:

Trying to modify a global variable in two threads

The reason for why there is in-kernel concurrency for single-threaded processes:

  1. Multi-tasking
  2. Interrupt handling
  3. Device drivers
  4. System calls
  5. Resource sharing

Critical Region and how to identify

A critical region is a region of code where shared resources are accessed.

We can control access to the shared resource by controlling access to the code that accesses the resource.

Uncoordinated entry to the critical region results in a race condition (Concurrency Occurs)

HOW TO IDENTIFY:

Critical regions are the regions of code that

  1. access a shared resource,
  2. and correctness relies on the shared resource not being concurrently modified by another thread/process/entity.\

HOW TO PREVENT:

Critical Regions Solutions

Conditions required of any solution to the critical region problem:

  1. Mutual exclusion:
    1. No two processes/threads in the critical region at the same time
  2. No assumptions made about speeds or numbers of CPUs
  3. No process running outside its critical region may block another process
  4. No process waited forever to enter its critical region (Bounded)

Mutual exclusion by taking turns

Works for solving concurrency issues: strict alternation -> each process takes turns

Cons (缺点)

  1. Busy waiting
  2. Process must wait its turn even while the other process is doing something else
    1. In a multiple processes condition, must wait for everyone to have a turn
      1. Does not guarantee progress if a process no longer needs a turn
    2. Poor solution when processes require the critical section at different rates

Mutual Exclusion by disabling interrupts

How it works:

Before entering a critical region, disable interrupts.

After leaving the critical region, enable interrupts.

Cons:

  1. Only available in the kernel
  2. Delays everybody else, even with no contention
    1. Slows the interrupt response time
  3. Does not work on a multiprocessor

Hardware Support for mutual exclusion

Test and Set instruction

•Can be used to implement lock variables correctly

•Load the value of the lock

•If lock == 0

•Set the lock to 1

•Return the result 0 -> we acquire the lock

•if lock == 1

• return 1 -> another thread/process has the lock

•Hardware guarantees that the instruction executes atomically -> not be interrupt

Pros:

  1. Simple (easy to verify work or not)
  2. Available at user level
    1. Work with any number of processors
    2. To implement any number of lock variables

Cons:

  1. Busy waits (also named as a spin lock)
    1. Consumes CPU
    2. Starvation is possible when a process leaves its critical section and more than one process is waiting.

Tackling the busy-wait problem:

Sleep/Wakeup

  1. When the process is waiting for an event, it calls sleep to block, instead of busy waiting.
  2. The event happens, the event generator (another process) calls wakeup to unblock the sleeping process.
  3. Waking a ready/running process has no effect.

Producer-Consumer Problem(bounded buffer problem)

Description: producer produces data items and store items in a buffer.

Consumer takes the items out of the buffer and consumes them.

Issues:

Producer

Should sleep when the buffer is full.

And wakeup when there is empty space in the buffer.

The consumer can call wakeup when it consumes the first entry of the full buffer.

Consumer

Should sleep when the buffer is empty.

And wakeup when there are items available.

Producer can call wakeup when it adds the first item to the buffer.

Semaphores

Two primitives that are more powerful than simple sleep and wakeup alone.

• P(): proberen, from Dutch to test.

• V(): verhogen, from Dutch to increment.

• Also called wait & signal, down & up.

How it works:

If a resource is not available, the corresponding semaphore blocks any process waiting for the resource.

Blocked processes are put into a process queue maintained by the semaphore (avoids busy waiting!)

When a process releases a resource, it signals this by means of the semaphore.

Signaling resumes a blocked process if there is any.

Wait (P) and signal (V) operations cannot be interrupted.

Complex coordination can be implemented by multiple semaphores.

The initial count determines how many waits can progress before blocking and requiring a signal.

Producer Consumer problem with semaphores:

Summarising Semaphores

Semaphores can be used to solve a variety of concurrency problems.

However, programming with then can be error-prone:

E.g. must signal for every wait

To many or too few signals or waits, or signals and waits in the wrong order, can have bad results.

Monitor

A higher level synchronisation primitive.

Programming language construct

IDEA:

A set of procedures/variables/data types are grouped in a special module:monitor.

Variables and data types can only be accessed from within the monitor.

Only one process/thread can be in the monitor at any one time.

Mutual exclusion is implemented by the complier.

When a thread calls a monitor procedure that has a thread already inside, it will be in the entry queue and will sleep until the current thread exits the monitor.

Condition Variable

To allow a process to wait within the monitor, we need to declare a condition variable.

Condition variable can only be used with the operations wait and signal.

x.wait()

Means that the process invoking this operation is suspended until another process invokes.

Another thread can enter the monitor while original is suspended.

x.signal()

this operation resumes only one suspended process.

If no process is suspended, then this operation has no effect.

How to achieve producer-consumer problem with monitors is in P50 lec03.

Synchronisation Primitives in OS/161

Locks

Semaphores

Wait (P) / signal (V)

Condition Variables

Note: All three variants must hold the lock passed in.

Condition Variables and Bounded Buffers

cv_wait() will release the lock before block the process

Producer-consumer solution with condition variable

Dining Philosophers

哲学家就餐问题是一个经典的计算机科学同步问题,由Edsger Dijkstra在1965年提出。问题描述了五位哲学家围坐在圆桌旁的情景,他们在思考问题和吃饭之间交替。圆桌中间有一碗意面,每两位哲学家之间有一根筷子。每位哲学家需要两根筷子才能吃饭。哲学家们不能交谈,只能思考或吃饭。

问题在于设计一个协议,使得每位哲学家都能够在需要时使用两根筷子吃饭,而不会出现死锁(即所有哲学家都在等待筷子,无法继续吃饭)或者饥饿(即某位哲学家长时间无法获得筷子)的情况。

以下是一个可能的解决方案:

我们可以为每根筷子分配一个编号(例如,1到5),并为每位哲学家分配一个编号。每位哲学家在需要吃饭时,先尝试拿起编号较低的筷子,然后尝试拿起编号较高的筷子。当哲学家同时拥有两根筷子时,他们可以开始吃饭。吃完饭后,他们将筷子放回原位,然后继续思考。

这种方法可以避免死锁,因为至少有一位哲学家(具有最低编号筷子的哲学家)可以拿起两根筷子并开始吃饭。其他哲学家则需要等待筷子被放回。虽然这种方法可能导致某些哲学家等待时间较长,但可以确保所有哲学家都有机会吃饭,避免饥饿现象。

Reader and Writer Problem

读者-写者问题是一个经典的计算机科学并发控制问题,用于描述多个进程在访问共享资源时需要进行同步的场景。在这个问题中,有一些读者进程和写者进程。读者进程只读取共享资源,而写者进程可以修改共享资源。问题的挑战在于设计一个同步协议,以允许多个读者进程同时访问共享资源,但在写者进程访问资源时,确保其他进程(读者和写者)无法访问资源。

以下是一个可能的解决方案:

我们可以使用互斥量(mutex)和信号量(semaphore)来实现这个协议。在这个解决方案中,我们需要一个互斥量来保护对共享资源的访问计数器,以及一个信号量来控制对共享资源的访问。

读者进程:

  1. 请求互斥量以修改访问计数器。
  2. 将访问计数器加1。如果这是第一个读者进程,请求信号量以阻止写者进程访问共享资源。
  3. 释放互斥量。
  4. 访问共享资源。
  5. 请求互斥量以修改访问计数器。
  6. 将访问计数器减1。如果这是最后一个读者进程,释放信号量以允许写者进程访问共享资源。
  7. 释放互斥量。

写者进程:

  1. 请求信号量以阻止其他进程访问共享资源。
  2. 访问共享资源。
  3. 释放信号量以允许其他进程访问共享资源。

这个解决方案允许多个读者进程同时访问共享资源,但当有写者进程需要访问资源时,它们将等待所有当前的读者进程完成。同样,在写者进程访问共享资源时,其他读者和写者进程将被阻止。这种方法可以确保在写者进程访问资源时,没有其他进程可以访问共享资源。

Lec04 DeadLock

DeadLock & Resources

Deadlocks occurs when

  1. Processes are granted exclusive access to devices/locks/tables..
  2. We refer to these entities generally as resources.

Deadlock: definition

A set of processes is deadlocked if each process in the set is waiting for an event that only another process in the set can cause.

In the deadlock situation, no process can

  1. Run
  2. Release resources
  3. Be awakened

Four conditions for deadlock

  1. Mutual exclusion condition
    1. Each resource assigned to 1 process or is available
  2. Hold and wait condition
    1. Process holding resources can request additional
  3. No preemption condition
    1. Previously granted resources cannot be forcibly taken away
  4. Circular wait condition
    1. Must be a circular chain of 2 or more processes
    2. Each is waiting for resource held by next member of the chain

Deadlock avoidance

  1. Just ignore the problem altogether
  2. Prevention 预防
    1. Negating on of the four necessary conditions
  3. Detection and recovery
  4. Dynamic avoidance
    1. Careful resource allocation

Method 1: Ostrich Algorithm

Pretend there is no problem

Reasonable if

  1. Deadlocks occur very rarely
  2. Cost of prevention is high

UNIX and Windows takes this approach for some of the more complex resource relationships they manage

It’s a trade off between

  1. Convenience (engineering approach)
  2. Correctness (mathematical approach)

Method 2: Deadlock prevention

Resource allocation rules prevent deadlock by prevent one of the four conditions required for deadlock from occurring:

  1. Mutual exclusion
  2. Hold and wait
  3. No preemption
  4. Circular Wait
Attacking the Mutual Exclusion Condition (Not feasible)

Not feasible in general

Some devices/resources are intrinsically(本质上) not sharable.

Attacking the hold and wait condition (request resources initially)

Require processes to request resources before starting

A process never has to wait for what it needs

Issues:

  1. May not know required resources at start of run

Which means not always possible

  1. Also ties up resources other processes could be using

Variations:

  1. Process must give up all resources if it would block holding a resource
  2. Then request all immediately need
  3. Prone(倾向于) to livelock
Attacking the No Preemption condition(无优先权条件)(take resources away)

Not a viable option

Attacking the circular wait condition (Order resources)

Numerically ordered resources (Resources ordering is a common technique)

Method 3: Detection and recovery

Need a method to determine if a system is deadlocked.

Assuming deadlocked is detected, we need a method of recovery to restore progress to the system.

Strategy:

Note the resource ownership and requests

A cycle can be found within the graph, which denotes a deadlock

Detection with Multiple Resources of Each Type

Sum of current resource allocation + resources available = resources that exist

Algorithm
  1. Look for an unmarked process Pi, for which the i-th row of R is less than or equal to A.
  2. If found, add the i-th row of C to A, and mark Pi, then go back to step 1.
  3. If no such process exists, terminate.

The remaining processes are deadlocked.

Recovery from Deadlock

Recovery through preemption

• take a resource from some other process

• depends on nature of the resource

Recovery through rollback

• checkpoint a process periodically

• use this saved state

• restart the process if it is found deadlocked

• No guarantee is won’t deadlock again

Recovery through killing processes

• crudest but simplest way to break a deadlock

• kill one of the processes in the deadlock cycle

• the other processes get its resources

• choose process that can be rerun from the beginning

Method 4: Deadlock Avoidance

Only enough information is available in advance, we can avoid a deadlock.

  1. Maximum number of each resource required
Safe and Unsafe states

A state is safe if

• The system is not deadlocked

• There exists a scheduling order that results in every process running

to completion, even if they all request their maximum resources

immediately

Unsafe states are not necessarily deadlocked

• With a lucky sequence, all processes may complete

• However, we cannot guarantee that they will complete

(not deadlock)

Safe states guarantee we will eventually complete all processes

•Deadlock avoidance algorithm

• Only grant requests that result in safe states

LiveLock

Livelocked processes are not blocked, change state regularly, but never make progress

Bankers Algorithm

Modelled on a Banker with Customers

• The banker has a limited amount of money to loan customers

• Limited number of resources

• Each customer can borrow money up to the customer’s credit limit

• Maximum number of resources required

Basic Idea

• Keep the bank in a safe state

• So all customers are happy even if they all request to borrow up to their

credit limit at the same time.

• Customers wishing to borrow such that the bank would enter an unsafe state must wait until somebody else repays their loan such that the the

transaction becomes safe

Starvation

A process never receives the resource it is waiting for, despite the resource (repeatedly) becoming free, the resource is always allocated to another waiting process

One Solution: First come, first server

Lec05 Processes and Threads Implementation

MIPS Registers

User-mode accessible registers

• 32 general purpose registers

• r0 hardwired to zero

• r31 the link register for jump-and-link

(JAL) instruction

HI/LO

• 2 * 32-bits for multiply and divide

PC

• Not directly visible

• Modified implicitly by jump and branch instructions

Branching and Jumping

Branching and jumping have a branch delay slot

The instruction following a branch or jump is always executed prior to destination of jump

Process

Memory allocation

Minimally consist of three segments

• Text

• contains the code (instructions)

• Data

• Global variables

• Stack

• Activation records of procedure/function/method

• Local variables

Note:

• data can dynamically grow up

• E.g., malloc()-ing

• The stack can dynamically grow down

• E.g., increasing function call depth or recursion

P23 lec05

Mode distinguish

User-mode

• Processes (programs) scheduled by the kernel

• Isolated from each other

No concurrency issues between each other

System-calls transition into and return from the kernel

Kernel-mode

• Nearly all activities still associated with a process

• Kernel memory shared between all processes

• Concurrency issues exist between processes concurrently executing in a system call

User-level Threads

Implementation at user-level

• User-level Thread Control Block (TCB), ready queue, blocked queue, and dispatcher

• Kernel has no knowledge of the threads (it only sees a single process)

• If a thread blocks waiting for a resource held by another thread inside

the same process, its state is saved and the dispatcher switches to another ready thread

• Thread management (create, exit, yield, wait) are implemented in a runtime support library

Pros

• Thread management and switching at user level is much faster than doing it in kernel level

• No need to trap (take syscall exception) into kernel and back to switch

• Dispatcher algorithm can be tuned to the application

• E.g. use priorities

• Can be implemented on any OS (thread or non-thread aware)

• Can easily support massive numbers of threads on a per-application

basis

• Use normal application virtual memory

• Kernel memory more constrained. Difficult to efficiently support wildly differing numbers of threads for different applications

Cons

• Threads have to yield() manually (no timer interrupt delivery to user level)

Co-operative multithreading

• A single poorly design/implemented thread can monopolise the available CPU time

• There are work-arounds (e.g. a timer signal per second to enable pre-emptive multithreading), they are course grain and a kludge.

• Does not take advantage of multiple CPUs (in reality, we still have a single threaded process as far as the kernel is concerned)

• If a thread makes a blocking system call (or takes a page fault), the process (and all the internal threads) blocks

• Can’t overlap I/O with computation

Kernel-provided Threads

Threads are implemented by the kernel

• TCBs (thread control block) are stored in the kernel

• A subset of information in a traditional PCB (process control block)

• The subset related to execution context

• TCBs have a PCB associated with them

• Resources associated with the group of threads (the process)

• Thread management calls are implemented as system calls

• E.g. create, wait, exit

Cons

• Thread creation and destruction, and blocking and unblocking threads requires kernel entry and exit.

• More expensive than user-level equivalent

Pros

• Preemptive multithreading

• Parallelism

• Can overlap blocking I/O with computation

Can take advantage of a multiprocessor

Context Switch

Context Switch is what the lowest level of OS does when an interrupt occurs.

A context switch can refer to

• A switch between threads

• Involving saving and restoring of state associated with a thread

• A switch between processes

• Involving the above, plus extra state associated with a process.

• E.g. memory maps

Context Switch Occurrence

A switch between process/threads can happen any time the OS is invoked

• On a system call

• Mandatory if system call blocks or on exit();

• On an exception

• Mandatory if offender is killed

• On an interrupt

• Triggering a dispatch is the main purpose of the timer interrupt

A thread switch can happen between any two instructions

Note instructions do not equal program statements

Context Switch Addition

  • Context switch must be transparent for processes/threads

• When dispatched again, process/thread should not notice that something else was running in the meantime (except for elapsed time)

  1. OS must save all state that affects the thread

• This state is called the process/thread context

• Switching between process/threads consequently

results in a context switch.

Lec06 Syscalls

Syscall Definition

Can be viewed as special function calls

• Provides for a controlled entry into the kernel

• While in kernel, they perform a privileged operation

• Returns to original caller with the result

The system call interface represents the abstract machine provided by the operating system.

CPU Computation Model

The fetch-execute cycle

• Load memory contents from address in program counter (PC)

• The instruction

• Execute the instruction

• Increment PC

• Repeat

Privileged-mode operation

To protect operating system execution, two or more CPU modes of operation exist

• Privileged mode (system-, kernel-mode)

All instructions and registers are available

• User-mode

• Uses ‘safe’ subset of the instruction set

• Only affects the state of the application itself

• They cannot be used to uncontrollably interfere with OS

Only ‘safe’ registers are accessible

The accessibility of addresses within an address space changes depending on operating mode

• To protect kernel code and data

• Note: The exact memory ranges are usually configurable, and vary between CPU architectures and/or operating systems.

System call mechanism securely transfers from user execution to kernel execution and back.

System call mechanism overview

• Processor mode

• Switched from user-mode to kernel-mode

• Switched back when returning to user mode

• Stack Pointer (SP)

• User-level SP is saved and a kernel SP is initialised

• User-level SP restored when returning to user-mode

• Program Counter (PC)

• User-level PC is saved and PC set to kernel entry point

• User-level PC restored when returning to user-level

Kernel entry via the designated entry point must be strictly enforced

•Registers

• Set at user-level to indicate system call type and its arguments

• A convention between applications and the kernel

• Some registers are preserved at user-level or kernel-level in order to restart user-level execution

• Depends on language calling convention etc.

• Result of system call placed in registers when returning to user-level

• Another convention

We need system calls because function calls do not:

  1. Change from user to kernel mode/back again
  2. Restrict possible entry points to secure locations

Coprocessor 0 (CP0)

The processor control registers are located in CP0

  1. Exception/Interrupt management registers
  2. Translation management registers

Exception management

C0_status

We only focus on 0-15 bits

·c0_cause

The 2nd-6th bit is for ExcCode -> The code number of the exception taken

C0_epc

The Exception Program Counter

• Points to address of where to restart execution after handling the exception or interrupt

Hardware Exception handling

Basic situation: Assume an interrupt occurred as the previous instruction completed & We are in user mode with interrupts enabled.

Steps:

  1. Instruction address at which to restart after the interrupt is transferred to EPC
  2. Interrupts disabled and previous state shifted along
    1. What stored in IEc/KUc is now in IEp/KUp
    2. Kernel mode is set, and previous state shifted along
      1. What stored in IEp/KUp (before 2 a) executed) is now in IEo/KUo
  3. Code for the exception placed in Cause -> ExcCode in c0_cause is now 0 (0 is the code for interrupt)
  4. Address of general exception vector placed in PC
  5. CPU now running in kernel mode at the address in PC, with interrupt disabled
    1. All information required to
      1. Find out what caused the exception
      2. Restart after exception handling

Is in coprocessor registers

  1. (Ignore how the OS handles the exception)
  2. Load the contents of EPC
  3. Store the EPC back in the PC
  4. In the branch delay slot (happens when we jump back to the PC/user mode), execute a restore from exception instruction -> move back KU/IE
  5. Now back in the same state we were when the exception happened

MIPS System Calls

• System calls are invoked via a syscall instruction.

• The syscall instruction causes an exception and transfers control to the general exception handler

• A convention (an agreement between the kernel and

applications) is required as to how user-level software

indicates

• Which system call is required

• Where its arguments are

• Where the result should go

OS/161 Systems Calls

OS/161 uses the following conventions:

• Arguments are passed and returned via the normal C function calling convention

• Additionally

• Reg v0 contains the system call number

On return, reg a3 contains

• 0: if success, v0 contains successful result

• not 0: if failure, v0 has the errno.

• v0 stored in errno

• -1 returned in v0

Summary of System Call in User Mode

• From the caller’s perspective, the read() system call behaves like a normal function call

• It preserves the calling convention of the language

•However, the actual function implements its own convention by agreement with the kernel

• Our OS/161 example assumes the kernel preserves appropriate registers(s0-s8, sp, gp, ra).

•Most languages have similar libraries that interface with the operating system.

System Calls – Kernel Side

• Things left to do

• Change to kernel stack

• Preserve registers by saving to memory (on the kernel stack)

• Leave saved registers somewhere accessible to

• Read arguments

• Store return values

• Do the “read()”

• Restore registers

• Switch back to user stack

• Return to application

Lec07 Computer Hardware Review

Memory Hierarchy

Cashing

•Given two-levels of data storage: small and fast, versus large and slow,

• Can speed access to slower storage by using intermediate-speed storage as a cache.

CPU cache

• CPU cache is fast memory placed between the CPU and main memory

• 1 to a few cycles access time compared to RAM access time of tens – hundreds of cycles

Holds recently used data or instructions to save memory accesses.

Matches slow RAM access time to CPU speed if high hit rate

• Is hardware maintained and (mostly) transparent to software

• Sizes range from few kB to tens of MB.

• Usually a hierarchy of caches (2–5 levels), on- and off-chip

Performance

The performance depends on the hit rate in the first level.

Effective Access Time

Avoid Waiting for Disk Access

• Keep a subset of the disk’s data in main memory

  1. OS uses main memory as a cache of disk contents

Avoid Waiting for Internet Access

• Keep a subset of the Internet’s data on disk

  1. Application uses disk as a cache of the internet

Lec08 File Management

File Names

File system must provide a convenient naming scheme

• Textual Names

• May have restrictions

• Only certain characters

• E.g. no ‘/’ characters

• Limited length

• Only certain format

• E.g DOS, 8 + 3

• Case (in)sensitive

• Names may obey conventions (.c files for C files)

• Interpreted by tools (e.g. UNIX)

• Interpreted by operating system (e.g. Windows “con:”)

File Structure

• Sequence of Bytes

• OS considers a file to be unstructured

• Applications can impose their own structure

• Used by UNIX, Windows, most modern OSes

P9 lec08

File Types

• Regular files

•Directories

•Device Files

–May be divided into

•Character Devices – stream of bytes

•Block Devices

•Some systems distinguish between regular file types

–ASCII text files, binary files

File Access Types (Patterns)

•Sequential access 顺序读写

–read all bytes/records from the beginning

–cannot jump around, could rewind or back up

convenient when medium was magnetic tape

•Random access 随机读写

–bytes/records read in any order

essential for data base systems

–read can be …

•move file pointer (seek), then read or

–lseek(location,…);read(…)

•each read specifies the file pointer

–read(location,…)

Typical File Operations

  1. Create
  2. Delete
  3. Open
  4. Close
  5. Read
  6. Write
  7. Append
  8. Seek
  9. Get attributes
  10. Set Attribute
  11. Rename

Criteria for File Organization

Things to consider when designing file layout

•Rapid access

–Needed when accessing a single record

–Not needed for batch mode

•read from start to finish

•Ease of update

–File on CD-ROM will not be updated, so this is not a concern

•Economy of storage

–Should be minimum redundancy in the data

–Redundancy can be used to speed access such as an index

File Directories

•Provide mapping between file names and the files themselves

•Contain information about files

–Attributes

–Location

–Ownership

Directory itself is a file owned by the operating system

Hierarchical (Tree Structured) Directory

•Files can be located by following a path from the root, or master, directory down various branches

–This is the absolute pathname for the file

Can have several files with the same file name as long as they have unique path names(同一路径下的文件名不得相等)

Relative and Absolute Pathnames

•Absolute pathname

–A path specified from the root of the file system to the file

A Relative pathname

–A pathname specified from the cwd (current working directory)

Typical Directory Operations

  1. Create
  2. Delete
  3. Opendir
  4. Closedir
  5. Readdir
  6. Rename
  7. Link
  8. Unlink

Nice Properties of UNIX naming

•Simple, regular format

–Names referring to different servers, objects, etc., have the same syntax.

•Regular tools can be used where specialised tools would be otherwise be needed.

•Location independent

–Objects can be distributed or migrated, and continue with the same names

Access Rights

In multiuser system, we need to consider the issue of the access rights.

Different kinds of Access Rights:

  1. None
    1. User may not know the existence of the file
    2. User is not allowed to read the directory that includes the file
  2. Knowledge
    1. User can only determine that the file exists and who its owner is
  3. Execution
    1. The user can load and execute a program but cannot copy it
  4. Reading
    1. The user can read the file for any purpose, including copying and execution
  5. Appending
    1. The user can add data to the file but cannot modify or delete any of the file’s contents
  6. Updating
    1. The user can modify, delete, and add to the file’s data. This includes creating the file, rewriting it, and removing all or part of the data
  7. Changing protection
    1. User can change access rights granted to other users
  8. Deletion
    1. User can delete the file
  9. Owners
    1. Has all rights previously listed
    2. May grant rights to others using the following classes of users
      1. •Specific user
      2. •User groups
      3. •All for public files

Simultaneous Access

In multiuser system, we need to consider the issue of the simultaneous access.

•Most OSes provide mechanisms for users to manage concurrent access to files

–Example: flock(), lockf(), system calls

•Typically

–User may lock entire file when it is to be updated

–User may lock the individual records (i.e. ranges) during the update

Mutual exclusion and deadlock are issues for shared access

Lec09 File System Internals

UNIX Storage Stack

From lower to Higher:

  1. Hard disk platters:
    1. Tracks
    2. Sectors
  2. Disk controller
    1. Hides disk geometry, bad sectors
    2. Exposes linear sequence of blocks(disk’s interface)
  3. Device driver
    1. Hides device-specific protocol
    2. Exposes block-device interface
  4. Disk scheduler/buffer cache -> Optimisations
    1. Keep recently accessed disk blocks in memory
    2. Schedule disk accesses from multiple processes for performance and fairness
  5. File System
    1. Hides physical location of data on the disk
    2. Exposes:
      1. Directory hierarchy 目录的层次结构
      2. Symbolic file names
      3. Random-access files
      4. Protrction
  6. Virtual File System (VFS)
    1. Unified interface to multiple File systems
  7. Open File Table (OF table)/File Descriptor Table (FD table)
    1. Keep track of files opened by user-level processes
    2. Matches syscall interface to VFS interface

Difference between some FS

Different physical nature of storage devices

– Ext3 is optimised for magnetic disks

– JFFS2 is optimised for flash memory devices

– ISO9660 is optimised for CDROM

Different storage capacities

– FAT16 does not support drives >2GB

– FAT32 becomes inefficient on drives >32GB

– ZFS, Btrfs is designed to scale to multi-TB disk arrays

Different CPU and memory requirements

– FAT16 is not suitable for modern PCs but is a good fit for many embedded devices

Proprietary standards

– NTFS may be a nice FS, but its specification is closed

Property of hard disk

– Seek time

• ~15ms worst case

– Rotational delay

• 8ms worst case for 7200rpm drive

– For comparison, disk-to-buffer transfer speed of a modern drive is ~10µs per 4K block

Conclusion: keep blocks that are likely to be accessed together close to each other

Implementing a file system

Requirements

The FS must map symbolic file names into a collection of block addresses.

The FS must keep track of

– which blocks belong to which files.

– in what order the blocks form the file

– which blocks are free for allocation

Given a logical region of a file, the FS must track the corresponding block(s) on disk.

File Allocation Methods

Contiguous Allocation

Easy bookkeeping (need to keep track of the starting block and length of the file)

✔ Increases performance for sequential operations

Need the maximum size for the file at the time of creation

✗ As files are deleted, free space becomes divided into many small chunks (external fragmentation)

Example: ISO 9660 (CDROM)

Dynamic Allocation Strategies

– Disk space allocated in portions as needed

– Allocation occurs in fixed-size blocks

No external fragmentation

✔ Does not require pre-allocating disk space

✗ Partially filled blocks (internal fragmentation)

✗ File blocks are scattered across the disk

✗ Complex metadata management (maintain the collection of blocks

for each file)

Examples come after:

  1. Linked list allocation
  2. File Allocation Table (FAT)
  3. Inode-based FS structure

External and Internal fragmentation

External fragmentation

– The space wasted external to the allocated memory regions

– Memory space exists to satisfy a request but it is unusable as it is not contiguous

Internal fragmentation

– The space wasted internal to the allocated memory regions

Allocated memory may be slightly larger than requested memory; this size difference is wasted memory internal to a partition

Conclusion

Internal fragmentation is inside the allocated memory blocks

External fragmentation is between two allocated memory blocks

Dynamic allocation : Linked list allocation

Each block contains the block number of the next block in the chain. Free blocks are also linked in a chain.

✔ Only single metadata entry per file

Best for sequentially accessed files

Poor for random access

✗ Blocks end up scattered across the disk due to free list eventually being randomised

Dynamic Allocation: File allocation Table (FAT)

• Keep a map of the entire FS in a separate table

– A table entry contains the number of the next block of the file

– The last block in a file and empty blocks are marked using reserved values

• The table is stored on the disk and is replicated in memory

Random access is fast (following the in-memory list)

Issues:

Requires a lot of memory for large disks

File allocation table disk layout

Dynamical Allocation: inode-based FS structure

Idea: separate table (index-node or i-node) for each file.

– Only keep table for open files in memory

Fast random access

The most popular FS structure today

Issues:

  1. i-nodes occupy one or several disk areas
  2. i-nodes are allocated dynamically, hence free-space management is required for i-nodes
    1. – Use fixed-size i-nodes to simplify dynamic allocation
    2. – Reserve the last i-node entry for a pointer (a block number) to an extension i-node
  1. Free-space management
    1. Approach 1: Linked list of free blocks in free blocks on disk
    2. Keep bitmaps of free blocks and free i-nodes on disk
Free block List

List of all unallocated blocks

Background jobs can re-order list for better contiguity

Store in free blocks themselves

Does not reduce disk capacity

Only one block of pointers need be kept in the main memory

Bit tables

Individual bits in a bit vector flags used/free blocks

16GB disk with 512-byte blocks –> 4MB table

May be too large to hold in main memory

Expensive to search

– Optimisations possible, e.g. a two level table

Concentrating (de)allocations in a portion of the bitmap has desirable effect of concentrating access

Simple to find contiguous free space

Implementing directories

Directories are stored like normal files

– directory entries are contained inside data blocks

• The FS assigns special meaning to the content of these files

– a directory file is a list of directory entries

– a directory entry contains file name, attributes, and the file i-node number

• maps human-oriented file name to a system-oriented name

Fixed-size directory entries

– Either too small

• Example: DOS 8+3 characters

– Or waste too much space

• Example: 255 characters per file name

Variable-size directory entries

– Freeing variable length entries can create external fragmentation in directory blocks

• Can compact when block is in RAM

Searching Directory methods

  1. Linear scan
    1. Implement a directory cache in software to speed-up search
  2. Hash lookup
  3. B-tree

Storing file attributes

  1. Disk addresses and attributes in directory entry -> FAT
  2. Directory in which each entry just refers to an i-node -> UNIX

File system block size

File systems deal with 2 types of blocks

– Disk blocks or sectors (usually 512 bytes)

– File system blocks 512 * 2^N bytes

What should N be will be the problem.

Smaller blocks waste less disk space (less internal fragmentation)

• Sequential Access

– The larger the block size, the fewer I/O operations required

• Random Access

– The larger the block size, the more unrelated data loaded.

– Spatial locality of access improves the situation

• Choosing an appropriate block size is a compromise

Lec10 UNIX File Management (continue)

Virtual file system (VFS)

Traversing the directory hierarchy may require VFS to issue requests to several underlying file systems.

• Provides single system call interface for many file systems

– E.g., UFS, Ext2, XFS, DOS, ISO9660,…

• Transparent handling of network file systems

– E.g., NFS, AFS, CODA

• File-based interface to arbitrary device drivers (/dev)

• File-based interface to kernel data structures (/proc)

Provides an indirection layer for system calls

– File operation table set up at file open time

– Points to actual handling code for particular type

– Further file operations redirected to those functions

Data Types in VFS Interface

VFS

  1. Represent all file system types
  2. Contains pointers to functions to manipulate each file system as a whole
    1. Form a standard interface to the file system

Vnode

  1. Represents a file (inode) in the underlying filesystem
  2. Points to the real inode
  3. Contains pointers to funcions to manipulate files/inodes like open/close/read

File descriptors

Attributes

  1. Each open file has a file descriptor
  2. Read/Write/Lseek/… use them to specify which file to operate on.

State associated with a file descriptor

  1. File pointer
    1. Determine where in the file the next read or write is performed
  2. Mode
    1. Was the file opened with read-only permission or sth…

Use single fd table with no of table

How to achieve:

Use vnode numbers as file descriptors and add a file pointer to the vnode.

Problem:

Cannot handling concurrency situations.

At this situation:

  1. we only have a single global open file array.
  2. Entries contains file pointer and a pointer to a vnode.

Issues:

File descriptor 1 is stdout -> console for some processes but may be a file for other processes

Entry 1 should be different different per process!

Per-process File descriptor array

Each process has its own open file array

Issue

  1. Fork
    1. Fork defines that the child shares the file pointer with the parent
  2. Dup2
    1. Also defines the file descriptors share the file pointer
  3. With per-process table, we can only have independent file pointers, even when accessing the same file

Per-Process fd table with global open file table

Per-process file descriptor array contains pointers to open file table entry

Open file table array contains entries with a file pointer and a pointer to an vnode.

This model provides

  1. Shared file pointers if required
  2. Independent file pointers if required

This model used by linux and most other UNIX operating systems

Buffer

–Temporary storage used when transferring data between two entities

•Especially when the entities work at different rates

•Or when the unit of transfer is incompatible

•Example: between application program and disk

Buffer disk blocks

•Allow applications to work with arbitrarily sized region of a file

–However, apps can still optimise for a particular block size

•Writes can return immediately after copying to kernel buffer

Avoids waiting until write to disk is complete

–Write is scheduled in the background

•Can implement read-ahead by pre-loading next block on disk into kernel buffer

–Avoids having to wait until next read is issued

Cache

Fast storage used to temporarily hold data to speed up repeated access to the data -> such as main memory can cache disk blocks

•On access

–Before loading block from disk, check if it is in cache first

Avoids disk accesses

•Can optimise for repeated access for single or several processes

•Data is read into buffer; an extra independent

cache copy would be wasteful

After use, block should be cached

•Future access may hit cached copy

•Cache utilises unused kernel memory space;

–may have to shrink, depending on memory demand

Cache replacement

When the buffer cache is full and we need to read another block into memory, we must choose an existing entry to replace

Policies:

  1. FIFO (first in first out)
  2. Least recently used

Etc.

File system consistency

Generally, cached disk blocks are prioritised in terms of how critical they are to file system consistency.

Directory blocks, inode blocks if lost can corrupt entire filesystem

•These blocks are usually scheduled for immediate write to disk

Data blocks if lost corrupt only the file that they are associated with

•These blocks are only scheduled for write back to disk periodically

•In UNIX, flushd (flush daemon) flushes all modified blocks to disk every 30 seconds

Possible Solution

Write-through cache

  1. All modified blocks are written immediately to disk

But this will generate more disk traffic

Works in the following situations:

  1. Floppies were removed from drives
  2. Users were constantly resetting (or crashing) their machines

This method is still be used in USB storage.

Lec11 ext2 FS

Descriptions:

• Second Extended Filesystem

– The main Linux FS before ext3

– Evolved from Minix filesystem (via “Extended Filesystem”)

• Features

– Block size (1024, 2048, and 4096) configured at FS creation

inode-based FS

– Performance optimisations to improve locality (from BSD FFS)

• Main Problem: unclean unmount -> e2fsck

– Ext3fs keeps a journal of (meta-data) updates

– Journal is a file where updates are logged

– Compatible with ext2fs

i-node

• Each file is represented by an inode on disk

• Inode contains the fundamental file metadata

– Access rights, owner, accounting info

– (partial) block index table of a file

Each inode has a unique number

– System oriented name

• Directories map file names to inode numbers

– Map human-oriented to system-oriented names

Inode contents

  1. Mode
    1. Type
      1. Regular file or directory
    2. Access mode
      1. Rwxrwxrwx
  2. Uid
    1. User ID
  3. Gid
    1. Group ID
  4. Atime
    1. Time of last access
  5. Ctime
    1. Time when the file was created
  6. Mtime
    1. Time when file was modified
  7. Size
    1. Offset of the highest byte written
  8. Block count
    1. Number of disk blocks used by the file
    2. Note that number of blocks can be much less than exected given the file size

IMPORTANT:

Files can be sparsely populated.

Such as using lseek. the blocks between them might be empty.

So the size is not the same as the block count.

  1. Direct Blocks
    1. Block numbers of the first 12 blocks in the file
    2. Most files are small
      1. We can find block of file directly from the inode.

For larger files, we store the data blocks’ offsets greater than 12 blocks in single/double/triple indirect.

  1. Single indirect
    1. Requires two disk access to read
      1. One for the indirect block, one for the target block.
    2. Max file size
      1. Assume 1K byte block size, 4 byte block numbers
        1. 12 * 1K + 1K/4 * 1K = 268 KiB

So for large majority of files (<268KiB), given the inode, only one or two further accesses required to read any block in file.

  1. Double indirect
    1. –Block number of a block containing block numbers of blocks containing block numbers -> the second level is also block numbers of blocks, the third level has the block numbers
  2. Triple indirect
    1. Block number of a block containing block numbers of blocks containing block numbers of blocks containing block numbers

Max File Size

Assume 4 bytes block numbers and 1K blocks

• The number of addressable blocks

– Direct Blocks = 12

– Single Indirect Blocks = 256

– Double Indirect Blocks = 256 * 256 = 65536

– Triple Indirect Blocks = 256 * 256 * 256 = 16777216

• Max File Size

12 + 256 + 65536 + 16777216 = 16843020 blocks ≈ 16 GB

Example:

• Assume 4K blocks, 4 byte block numbers, 12 direct blocks

what if we add

– lseek(fd, 5242880, SEEK_SET) /* 5 megabytes */

– write(fd, “x”, 1)

Address = 5242880 ==>

Block number = 5242880/4096 =1280

Double indirect offset (20-bit)

= 1280 – 1036

= 244

Top 10 bits = 0 -> first level block’s 0th entry

Lower 10 bits = 244-> second level block 244th entry

Best and worse case access patterns

Assume inode already in memory

To read 1 byte

– Best:

• 1 access via direct block

– Worst:

• 4 accesses via the triple indirect block

To write 1 byte

– Best:

• 1 write via direct block (with no previous content)

– Worst:

• 4 reads (to get previous contents of block via triple indirect) + 1 write (to write modified block back)

Worst Case Access Patterns with Unallocated Indirect Blocks

• Worst to write 1 byte

– 4 writes (3 indirect blocks; 1 data)

  • All indirect are new

– 1 read, 4 writes (read-write 1 indirect, write 2; write 1 data)

  • First indirect has something

– 2 reads, 3 writes (read 1 indirect, read-write 1 indirect, write 1; write 1 data)

  • First and second indirect have something

– 3 reads, 2 writes (read 2, read-write 1; write 1 data)

  • All three indirect have something

• Worst to read 1 byte

– If reading writes a zero-filled block on disk

– Worst case is same as write 1 byte

– If not, worst-case depends on how deep is the current indirect block tree.

Inode Summary

The inode (and indirect blocks) contains the on-disk metadata associated with a file

– Contains mode, owner, and other bookkeeping

– Efficient random and sequential access via indexed allocation

– Small files (the majority of files) require only a single access

– Larger files require progressively more disk accesses for random access

• Sequential access is still efficient

– Can support really large files via increasing levels of indirection

System V Disk Layout (s5fs)

– Boot Block

• contain code to bootstrap the OS

– Super Block

• Contains attributes of the file system itself

• e.g. size, number of inodes, start block of inode array, start of data block area, free inode list, free data block list

– Inode Array

– Data blocks

Problems

• Inodes at start of disk; data blocks end

– Long seek times

• Must read inode before reading data blocks

• Only one superblock

– Corrupt the superblock and entire file system is lost

• Block allocation was suboptimal

– Consecutive free block list created at FS format time

• Allocation and de-allocation eventually randomises the list resulting in random allocation

• Inode free list also randomised over time

– Directory listing resulted in random inode access pattern

Berkeley Fast Filesystem (FFS)

•Historically followed s5fs

–Addressed many limitations with s5fs

–ext2fs mostly similar

Layout of an Ext2 FS

•Partition:

–Reserved boot block,

–Collection of equally sized block groups

–All block groups have the same structure

Layout of a block group

Replicated super block

–For e2fsck

Replicated Group descriptors

•Bitmaps identify used inodes/blocks

•All block groups have the same number of data blocks

•Advantages of this structure:

–Replication simplifies recovery

–Proximity of inode tables and data blocks (reduces seek time)

Superblocks

•Size of the file system, block size and similar parameters

•Overall free inode and block counters

•Data indicating whether file system check is needed:

–Uncleanly unmounted

–Inconsistency

–Certain number of mounts since last check

–Certain time expired since last check

•Replicated to provide redundancy to aid recoverability

Group Descriptors

•Location of the bitmaps

•Counter for free blocks and inodes in this group

•Number of directories in the group

•Replicated to provide redundancy to aid recoverability

这些信息有助于文件系统定位和管理每个块组的资源。

Performance considerations

EXT2 optimisations

– Block groups cluster related inodes and data blocks

–Pre-allocation of blocks on write (up to 8 blocks)

•8 bits in bit tables

•Better contiguity when there are concurrent writes

Aim to store files within a directory in the same group

Ext2 Directories

•Directories are files of a special type

•Consider it a file of special format, managed by the kernel, that uses most of the same machinery to implement it

–Inodes, etc…

•Directories translate names to inode numbers

•Directory entries are of variable length

•Entries can be deleted in place

•inode = 0

•Add to length of previous entry

“f1” = inode 7

“file2” = inode 43

“f3” = inode 85

Inodes can have more than one name -> hard link

Reference Count in inode

Keep a reference count in the inode, so:

When we delete a file by name, i.e. remove the directory entry (link), the file system know when to delete the underlying inode

•Adding a name (directory entry) increments the count

•Removing a name decrements the count

•If the reference count == 0, then we have no names for the inode (it is unreachable), we can delete the inode (underlying file or directory)

• A symbolic link is a file that contains a reference to another file or directory

Has its own inode and data block, which contains a path to the target file

– Marked by a special file attribute

– Transparent for some operations

Can point across FS boundaries -> symbolic link only care about the path of the certain file, instead of the destination.

File system reliability

• Disk writes are buffered in RAM

– OS crash or power outage ==> lost data

– Commit writes to disk periodically (e.g., every 30 sec)

– Use the sync command to force a FS flush

• FS operations are non-atomic

– Incomplete transaction can leave the FS in an inconsistent state

Other examples

• e2fsck

– Scans the disk after an unclean shutdown and attempts to restore FS invariants

• Journaling file systems

– Keep a journal of FS updates

– Before performing an atomic update sequence,

– write it to the journal

– Replay the last journal entries upon an unclean shutdown

– Example: ext3fs

Lec12 ext3 FS

Design goals

Add journaling capability to the ext2 FS

Backward and forward compatibility with ext2

• Existing ext2 partitions can be mounted as ext3

– Leverage the proven ext2 performance

– Reuse most of the ext2 code base

– Reuse ext2 tools, including e2fsck

The ext3 journal

Option 1: Journal FS data structure updates

Example:

Start transaction

– Delete dir entry

– Delete i-node

– Release blocks 32, 17, 60

End transaction

Conclusion

✔Efficient use of journal space; hence faster journaling

✘ Individual updates are applied separately

✘ The journaling layer must understand FS semantics

Option 2: Journal disk block updates

Example:

Start transaction

– Update block #n1 (contains the dir entry)

– Update block #n2 (i-node allocation bitmap)

– Update block #n3 (data block allocation bitmap)

Add transaction

Conclusion

✗ Even a small update adds a whole block to the journal

✔ Multiple updates to the same block can be aggregated into a single update

✔ The journaling layer is FS independent (easier to implement)

Ext3 use this method

Journaling Block Device (JBD)

JBD interface

– Start a new transaction

– Update a disk block as part of a transaction

– Complete a transaction

• Completed transactions are buffered in RAM

– Commit: write transaction data to the journal (persistent storage)

• Multiple FS transactions are committed in one go

– Checkpoint: flush the journal to the disk

• Used when the journal is full or the FS is being unmounted

Journaling modes

Ext3 supports two journaling modes

– Metadata+data

• Enforces atomicity of all FS operations

– Metadata journaling

• Metadata is journalled

• Data blocks are written directly to the disk

• Improves performance

• Enforces file system integrity

• Does not enforce atomicity of write’s

– New file content can be stale blocks

JBD conclusion

JBD

• JBD can keep the journal on a block device or in a file

– Enables compatibility with ext2 (the journal is just a normal file)

• JBD is independent of ext3-specific data structures

– Separation of concerns

• The FS maintains on-disk data and metadata

• JBD takes care of journaling

– Code reuse

• JBD can be used by any other FS that requires journaling

Lec13 memory Management

Fixed Partition Summary

• Simple

• Easy to implement

•Can result in poor memory utilisation

• Due to internal fragmentation

•Used on IBM System 360 operating system (OS/MFT)

• Announced 6 April, 1964

• Still applicable for simple embedded systems

• Static workload known in advance

Dynamic Partitioning

Allocation

•Also applicable to malloc()-like in-application allocators

•Given a region of memory, basic requirements are:

• Quickly locate a free partition satisfying the request

• Minimise CPU time search

• Minimise external fragmentation

• Minimise memory overhead of bookkeeping

• Efficiently support merging two adjacent free partitions into a larger partition

Classic Approach

Represent available memory as a linked list of available “holes” (free memory ranges).

• Base, size

• Kept in order of increasing address

• Simplifies merging of adjacent holes into larger holes.

• List nodes be stored in the “holes” themselves

Placement

First-fit algorithm

• Scan the list for the first entry that fits

• If greater in size, break it into an allocated and free part

• Intent: Minimise amount of searching performed

• Aims to find a match quickly

• Biases allocation to one end of memory

• Tends to preserve larger blocks at the end of memory

Next-fit

Like first-fit, except it begins its search from the point in list where the last request succeeded instead of at the beginning.

• (Flawed) Intuition: spread allocation more uniformly over entire memory to avoid skipping over small holes at start of memory

Performs worse than first-fit as it breaks up the large free space at

end of memory

Best-fit

• Chooses the block that is closest in size to the request

Performs worse than first-fit

• Has to search complete list

• does more work than first-fit

• Since smallest block is chosen for a process, the smallest amount of external fragmentation is left

• Create lots of unusable holes

Worst-fit

• Chooses the block that is largest in size (worst-fit)

• (whimsical) idea is to leave a usable fragment left over

• Poor performer

Has to do more work (like best fit) to search complete list

• Does not result in significantly less fragmentation

Summary

First-fit generally better than the others and easiest to implement

They are simple solutions to a still-existing OS or application service/function –memory allocation.

Largely have been superseded by more complex and specific allocation strategies

Compaction

We can reduce external fragmentation by compaction

• Shuffle memory contents to place all free memory together in one large block.

• Only if we can relocate running programs?

• Pointers?

• Generally requires hardware support

Issues

• Relocation

• How does a process run in different locations in memory?

• Protection

• How do we prevent processes interfering with each other

When are memory address bound?

• Compile/link time

• Compiler/Linker binds the addresses

• Must know “run” location at compile time

• Recompile if location changes

• Load time

• Compiler generates relocatable code

• Loader binds the addresses at load time

• Run time

• Logical compile-time addresses translated to physical addresses by special hardware.

Hardware Support for Runtime Binding and Protection

• For process to run using logical addresses

• Process expects to access addresses from zero to some limit of memory size

• Need to add an appropriate offset to its logical addresses

• Achieve relocation

• Protect memory “lower” than the process

• Must limit the maximum logical address the process can generate

• Protect memory “higher” than the process

Base and Limit Registers

•Also called

• Base and bound registers

• Relocation and limit registers

•Base and limit registers

Restrict and relocate the currently active process

• Base and limit registers must be changed at

• Load time

• Relocation (compaction time)

• On a context switch

Pros

• Supports protected multi-processing (-tasking)

Cons

Physical memory allocation must still be contiguous

• The entire process must be in memory

• Do not support partial sharing of address spaces

No shared code, libraries, or data structures between processes

Swapping

• A process can be swapped temporarily out of memory to a backing store, and then brought back into memory for continued execution.

• Swapping involves transferring the whole process

• Backing store – fast disk large enough to accommodate copies of all memory images for all users; must provide direct access to these memory images.

• Can prioritize – lower-priority process is swapped out so higher-priority process can be loaded and executed.

Major part of swap time is transfer time; total transfer time is directly proportional to the amount of memory swapped.

• slow

Virtual memory-paging

• Partition physical memory into small equal sized chunks

• Called frames

• Divide each process’s virtual (logical) address space into same size chunks

• Called pages

• Virtual memory addresses consist of a page number and offset within the page

• OS maintains a page table

• contains the frame location for each page

• Used by hardware to translate each virtual address to physical address

• The relation between virtual addresses and physical memory addresses is given by page table

Process’s physical memory does not have to be contiguous

Paging

No external fragmentation

Small internal fragmentation (in last page)

Allows sharing by mapping several pages to the same frame

• Abstracts physical organisation

• Programmer only deal with virtual addresses

• Minimal support for logical organisation

• Each unit is one or more pages

Memory Management Unit (Translation Look-aside Buffer/TLB)

Lec14 Virtual memory

Page-based VM

Virtual Memory

– Divided into equal sized pages

– A mapping is a translation between

• A page and a frame

• A page and invalid

– Mappings defined at runtime

• They can change

– Address space can have holes

Process does not have to be contiguous in physical memory

Typical Address space Layout

• Stack region is at top, and can grow down

• Heap has free space to grow up

• Text is typically read-only

• Kernel is in a reserved, protected, shared region

• 0-th page typically not used, why? -> NULL pointer

• A process may be only partially resident

– Allows OS to store individual pages on disk

– Saves memory for infrequently used data & code

• What happens if we access non resident memory? -> Page fault

Page fault

• Referencing an invalid page triggers a page fault

• An exception handled by the OS

• Broadly, two standard page fault types

– Illegal Address (protection error)

• Signal or kill the process

– Page not resident

• Get an empty frame

• Load page from disk

• Update page (translation) table (enter frame #, set valid bit, etc.

• Restart the faulting instruction

Shared Pages

Private code and data

– Each process has own copy of code and data

– Code and data can appear anywhere in the address space

Shared code

– Single copy of code shared between all processes executing it

– Code must not be self modifying

– Code must appear at same address in all processes

Trashing

• CPU utilisation tends to increase with the degree of multiprogramming

number of processes in system

Higher degrees of multiprogramming – less memory available per process

• Some process’s working sets may no longer fit in RAM

– Implies an increasing page fault rate

• Eventually many processes have insufficient memory

– Can’t always find a runnable process

– Decreasing CPU utilisation

– System become I/O limited

  • This is trashing

Recovery From Thrashing

In the presence of increasing page fault frequency and decreasing CPU utilisation

– Suspend a few processes to reduce degree of multiprogramming

– Resident pages of suspended processes will migrate to backing store

– More physical memory becomes available

• Less faults, faster progress for runnable processes

– Resume suspended processes later when memory pressure eases

Page Size

Increasing page size

Increases internal fragmentation

§ reduces adaptability to working set size

Decreases number of pages

Reduces size of page tables

Increases TLB coverage

Reduces number of TLB misses

Increases page fault latency

Need to read more from disk before restarting process

Increases swapping I/O throughput

Small I/O are dominated by seek/rotation delays

Optimal page size is a (work-load dependent) trade-off

Lec16 Multiprocessor Systems

Amdahl’s law

•Given a proportion P of a program that can be made parallel, and the remaining serial portion (1-P), speedup by using N processor

Bus-based uniform memory access multiprocessors

Simplest MP is more than one processor on a single bus connect to memory

• Access to all memory occurs at the same speed for all processors.

• Bus bandwidth becomes a bottleneck with more than just a few CPUs

Multiprocessor caches

Each processor has a cache to reduce its need for access to memory

• Hope is most accesses are to the local cache

Bus bandwidth still becomes a bottleneck with many CPUs

Cache consistency

Cache consistency is usually handled by the hardware.

• Writes to one cache propagate to, or invalidate appropriate entries on other caches

Cache transactions also consume bus bandwidth

Multi-core Processor

Summary

With only a single shared bus, scalability can be limited by the bus bandwidth of the single bus

• Caching only helps so much

•Multiprocessors can

• Increase computation power beyond that available from a single CPU

• Share resources such as disk and memory

•However

• Assumes parallelizable workload to be effective

• Assumes not I/O bound

Shared buses (bus bandwidth) limits scalability

Can be reduced via hardware design

• Can be reduced by carefully crafted software behaviour

• Good cache locality together with limited data sharing where possible

Each CPU has its own OS

• Statically allocate physical memory to each CPU

• Each CPU runs its own independent OS

• Share peripherals

• Each CPU (OS) handles its processes system calls

• Used in early multiprocessor systems to ‘get them going’

Simpler to implement

Avoids CPU-based concurrency issues by not sharing

• Scales – no shared serial sections

• Modern analogy, virtualisation in the cloud

Issues

• Each processor has its own scheduling queue

• We can have one processor overloaded, and the rest idle

• Each processor has its own memory partition

• We can a one processor thrashing, and the others with free memory

No way to move free memory from one OS to another

Symmetric Multiprocessors (SMP)

• OS kernel run on all processors

• Load and resource are balance between all processors

• Including kernel execution

• Issue: Real concurrency in the kernel

• Need carefully applied synchronisation primitives to avoid disaster

• One alternative: A single mutex that make the entire kernel a large critical section

Only one CPU can be in the kernel at a time

The “big lock” becomes a bottleneck when in-kernel processing exceeds what can be done on a single CPU

• Better alternative: identify largely independent parts of the kernel and make each of them their own critical section

Allows more parallelism in the kernel

• Issue: Difficult task

• Code is mostly similar to uniprocessor code

• Hard part is identifying independent parts that don’t interfere with each other

• Remember all the inter-dependencies between OS subsystems.

Multiprocessor synchronisation

Given we need synchronisation, how can we achieve it on a multiprocessor machine?

• Unlike a uniprocessor, disabling interrupts does not work.

• It does not prevent other CPUs from running in parallel

Need special hardware support

Test and set

Hardware guarantees that the instruction executes atomically on a

CPU.

• Atomically: As an indivisible unit.

• The instruction can not stop half way through

• It does not work without some extra hardware support

A solution:

• Hardware blocks all other CPUs from accessing the bus during the TSL instruction to prevent memory accesses by any other CPU.

• TSL has mutually exclusive access to memory for duration of instruction

BUT Test-and Set is a busy-wait synchronisation primitive. -> Spinlock

• Issue:

• Lock contention leads to spinning on the lock

• Spinning on a lock requires blocking the bus which slows all other CPUs down

• Independent of whether other CPUs need a lock or not

• Causes bus contention

Caching does not help reduce bus contention

• Either TSL still blocks the bus

• Or TSL requires exclusive access to an entry in the local cache

• Requires invalidation of same entry in other caches, and loading entry into local cache

• Many CPUs performing TSL simply bounce a single exclusive entry between all caches using the bus

Reducing Bus contention

Read before TSL

• Spin reading the lock variable waiting for it to change

• When it does, use TSL to acquire the lock

Allows lock to be shared read-only in all caches until its released

• no bus traffic until actual release

• No race conditions, as acquisition is still with TSL.

Benchmark

• Test and set performs poorly once there is enough CPUs to cause contention(争夺) for lock

• Expected

Read before Test and Set performs better

Performance less than expected

• Still significant contention on lock when CPUs notice release and all attempt acquisition

• Critical section performance degenerates

• Critical section requires bus traffic to modify shared structure

• Lock holder competes with CPU that’s waiting as they test and set, so the lock holder is slower

• Slower lock holder results in more contention

Spinning vs blocking and switching

Uniprocessor

Spinning (busy-waiting) on a lock makes no sense on a uniprocessor

• The was no other running process to release the lock

• Blocking and (eventually) switching to the lock holder is the only sensible option.

•On SMP systems, the decision to spin or block is not as clear.

• The lock is held by another running CPU and will be freed without necessarily switching away from the requestor

Multiprocessor

• Blocking and switching

to another process takes time

• Save context and restore another

• Cache contains current process not new process

• Adjusting the cache working set also takes time

• TLB is similar to cache

• Switching back when the lock is free encounters the same again

Spinning wastes CPU time directly

Trade off

• If lock is held for less time than the overhead of switching to and back

ÞIt’s more efficient to spin

Spinlocks expect critical sections to be short

No waiting for I/O within a spinlock

No nesting locks within a spinlock

Preemption and spinlocks

Critical sections synchronised via spinlocks are expected to be short

• Avoid other CPUs wasting cycles spinning

• What happens if the spinlock holder is preempted at end of holder’s timeslice

• Mutual exclusion is still guaranteed

• Other CPUs will spin until the holder is scheduled again!!!!!

Spinlock implementations disable interrupts in addition to acquiring locks

• avoids lock-holder preemption

• avoids spinning on a uniprocessor

Lec17 scheduling

Basic

The scheduler decides who to run next.

– The process of choosing is called scheduling.

Scheduling decisions can have a dramatic effect on the perceived performance of the system

– Can also affect correctness of a system with deadlines

Application Behaviour

  1. CPU-Bound process
    1. • Spends most of its computing
    2. • Time to completion largely determined by received CPU time
  2. I/O-Bound process
    1. – Spend most of its time waiting for I/O to complete
      1. • Small bursts of CPU to process I/O and request next I/O
    2. – Time to completion largely determined by I/O request time

Observation

• We need a mix of CPU-bound and I/O-bound processes

to keep both CPU and I/O systems busy

• Process can go from CPU- to I/O-bound (or vice ersa) in different phase of execution

Insight

• Choosing to run an I/O-bound process delays a CPU-bound process by very little

• Choosing to run a CPU-bound process prior to an I/O-bound process delays the next I/O request significantly

– No overlap of I/O waiting with computation

– Results in device (disk) not as busy as possible

Þ Generally, favour I/O-bound processes over CPU-bound processes

When is scheduling performed

when a process (or thread) can no longer continue, or when an activity results in more than one ready process.

Preemptive versus Non-preemptive Scheduling 先发调度与后发调度

Non-preemptive

– Once a thread is in the running state, it continues until it completes, blocks on I/O, or voluntarily yields the CPU

– A single process can monopolised the entire system

Preemptive

– Current thread can be interrupted by OS and moved to ready state.

– Usually after a timer interrupt and process has exceeded its maximum run time

Can also be as a result of higher priority process that has become ready (after I/O interrupt). – Ensures fairer service as single thread can’t monopolise the system

• Requires a timer interrupt

Categories of scheduling algorithms

• The choice of scheduling algorithm depends on the goals of the application (or the operating system)

– No one algorithm suits all environments

• We can roughly categorise scheduling algorithms as follows

– Batch Systems

• No users directly waiting, can optimise for overall machine performance

– Interactive Systems

• Users directly waiting for their results, can optimise for users perceived performance

– Realtime Systems

• Jobs have deadlines, must schedule such that all jobs (predictably) meet their deadlines.

Goals of scheduling Algorithms

All algorithms

– Fairness

• Give each process a fair share of the CPU

– Policy Enforcement

• What ever policy chosen, the scheduler should ensure it is carried out

– Balance/Efficiency

• Try to keep all parts of the system busy

Interactive Algorithms

Minimise response time

• Response time is the time difference between issuing a command and getting the result

– E.g selecting a menu, and getting the result of that selection

• Response time is important to the user’s perception of the performance of the system.

Provide Proportionality

• Proportionality is the user expectation that short jobs will have a short response time, and long jobs can have a long response time.

Generally, favour short jobs

Real-time Algorithms

– Must meet deadlines

• Each job/task has a deadline.

• A missed deadline can result in data loss or catastrophic failure

– Aircraft control system missed deadline to apply brakes

– Provide Predictability

• For some apps, an occasional missed deadline is okay

– E.g. DVD decoder

• Predictable behaviour allows smooth DVD

Round Robin Scheduling

• Each process is given a timeslice to run in

• When the timeslice expires, the next process preempts the current process, and runs for its timeslice, and so on

– The preempted process is placed at the end of the queue

• Implemented with

– A ready queue

– A regular timer interrupt

Pros

– Fair, easy to implement

Cons

Assumes everybody is equal

Issue: how long should the timeslice be

– Too short

Waste a lot of time switching between processes

• Example: timeslice of 4ms with 1ms context switch = 20% round robin overhead

– Too long

System is not responsive

• Example: timeslice of 100ms

– If 10 people hit “enter” key simultaneously, the last guy to run will only see progress after 1 second.

• Degenerates into FCFS if timeslice longer than burst length

Priorities

• Each Process (or thread) is associated with a priority

• Provides basic mechanism to influence a scheduler decision:

Scheduler will always chooses a thread of higher priority over lower priority

• Priorities can be defined internally or externally

– Internal: e.g. I/O bound or CPU bound

– External: e.g. based on importance to the user

Priority-driven preemptively scheduled

Usually implemented by multiple priority queues, with round robin on each queue

Con:

Low priorities can starve

• Need to adapt priorities periodically

– Based on ageing or execution history

Traditional UNIX scheduler

Two-level scheduler:

High-level scheduler schedules processes between memory and disk

Low-level scheduler is CPU scheduler

• Based on a multi-level queue structure with round robin at each level

• The highest priority (lower number) is scheduled

Priorities are re-calculated once per second, and re-inserted in appropriate queue

– Avoid starvation of low priority threads

– Penalise CPU-bound threads

• Priority = CPU_usage +nice +base

CPU_usage = number of clock ticks

Decays over time to avoid permanently penalising the process

Nice is a value given to the process by a user to permanently boost or reduce its priority

• Reduce priority of background jobs

Base is a set of hardwired, negative values used to boost priority of I/O bound system activities

• Swapper, disk I/O, Character I/O

Multiprocessor scheduling

A single shared ready queue

When a CPU goes idle, it takes the highest priority process from the shared ready queue

Pros

– Simple

– Automatic load balancing

Cons

Lock contention on the ready queue can be a major bottleneck

• Due to frequent scheduling or many CPUs or both

– Not all CPUs are equal

• The last CPU a process ran on is likely to have more related entries in the cache -> waste time to load another process

Affinity scheduling

Basic Idea

– Try hard to run a process on the CPU it ran on last time

One approach: Multiple Queue Multiprocessor Scheduling

Multiple Queue SMP Scheduling

Each CPU has its own ready queue

• Coarse-grained algorithm assigns processes to CPUs

– Defines their affinity, and roughly balances the load

• The bottom-level fine-grained scheduler:

– Is the frequently invoked scheduler (e.g. on blocking on I/O, a lock, or exhausting a timeslice)

– Runs on each CPU and selects from its own ready queue

• Ensures affinity

– If nothing is available from the local ready queue, it runs a process from another CPUs ready queue rather than go idle

• Termed “Work stealing

Pros

No lock contention on per-CPU ready queues in the (hopefully) common case

Load balancing to avoid idle queues

– Automatic affinity to a single CPU for more cache friendly behaviour

多队列 SMP 调度(Multiple Queue SMP Scheduling)是一种用于多处理器(SMP,Symmetric Multi-Processing)系统的进程调度策略。在这种策略中,每个处理器都有自己的一组独立的进程队列。这种方法旨在通过将进程分配到不同的处理器上,实现更好的负载平衡和并行性。

多队列 SMP 调度的主要特点如下:

  1. 独立的进程队列:在多队列 SMP 调度中,每个处理器都有自己的一组独立的进程队列,用于存储和管理等待执行的进程。每个处理器根据自己的队列中的进程来执行调度和分配任务。
  2. 负载平衡:多队列 SMP 调度策略尝试在处理器之间实现负载平衡。当一个处理器的队列中的进程数量明显多于其他处理器时,系统可能会将一些进程迁移到其他处理器的队列中,以实现更好的负载平衡。
  3. 并行性:多队列 SMP 调度充分利用了多处理器系统的并行性。通过在不同的处理器上同时执行多个进程,可以提高系统的整体性能和吞吐量。
  4. 局部性:由于每个处理器都有自己的进程队列,多队列 SMP 调度有助于维护进程的局部性。这意味着一个进程可能会在同一个处理器上连续执行,从而提高缓存命中率和性能。

然而,多队列 SMP 调度也存在一些挑战和缺点,例如在负载平衡和同步开销方面的问题。为了解决这些问题,研究人员和工程师们还提出了许多其他的多处理器调度策略,如全局队列调度、簇调度等。

总之,多队列 SMP 调度是一种适用于多处理器系统的进程调度策略,通过为每个处理器分配独立的进程队列来实现负载平衡和并行性。尽管存在一些挑战,但这种方法仍然是多处理器调度的一个重要方案

Cons:

  1. 负载平衡困难:尽管多队列 SMP 调度试图在处理器之间实现负载平衡,但在实践中,达到理想的负载平衡可能非常具有挑战性。在某些情况下,可能会出现一个或多个处理器的队列拥有大量进程,而其他处理器却处于空闲状态。这种不平衡可能导致系统资源的浪费和性能下降。
  2. 同步和通信开销:在多队列 SMP 调度中,处理器之间可能需要频繁地进行进程迁移以实现负载平衡。这会导致同步和通信开销增加,从而降低系统性能。在具有许多处理器的系统中,同步和通信开销可能会变得非常显著。
  3. 进程优先级管理困难:由于每个处理器都有自己的进程队列,管理不同处理器队列中的进程优先级可能变得复杂。在全局队列调度策略中,所有处理器共享一个进程队列,因此优先级管理相对更简单。
  4. 高速缓存一致性问题:多队列 SMP 调度中的进程迁移可能会导致高速缓存一致性问题。当一个进程从一个处理器迁移到另一个处理器时,它可能需要重新填充高速缓存,这会导致性能下降。虽然多队列 SMP 调度有助于维护进程的局部性,但频繁的进程迁移可能会削弱这一优势。
  5. 可伸缩性问题:在具有大量处理器的系统中,多队列 SMP 调度可能会遇到可伸缩性问题。随着处理器数量的增加,实现有效的负载平衡、优先级管理和同步变得更加困难。

尽管存在这些弊端,多队列 SMP 调度仍然是多处理器系统中的一种有效调度策略。为了解决这些问题,研究人员和工程师们还提出了许多其他的多处理器调度策略,如全局队列调度、簇调度等。这些策略在不同方面具有各自的优缺点,因此需要根据具体应用场景和需求进行选择。

TO BE NOTICED: NOT CONTAIN PART OF LEC 14 AND MOST PART OF LEC 15