UNSW COMP3231 23T1 notes
Lec 02 Processes and Threads
Processes:
• Also called a task or job
• Memory image of an individual program
• “Owner” of resources allocated for program execution
• Encompasses one or more threads
Threads:
• Unit of execution 执行命令的单位
• Can be traced
• list the sequence of instructions that execute
• Belongs to a process
• Executes within it.
One Process may contain one or more threads.

Process Termination
Conditions which terminate processes
1. Normal exit (voluntary)
2. Error exit (voluntary)
3. Fatal error (involuntary)
4. Killed by another process (involuntary)
Process and Thread States
Running
Blocked
Ready
Running → Ready
• Voluntary Yield()
• End of timeslice
Running → Blocked
• Waiting for input
• File, network,
• Waiting for a timer (alarm signal)
• Waiting for a resource to become available
Schedular
Sometime called as dispatcher, it will choose a ready process to run.
BUT: It is inefficient to search through all processes
e.g: Ready Queue & Blocked Queue

Thread Model
Per process items | Per thread items |
---|---|
Address space Global variables Open files Child processes Pending alarms Signals and signal handlers Accounting information |
Program counter Registers Stack (Each thread has its own stack) State |
Local variables are per thread (allocated on the stack)
Global variables are shared between all threads (Allocated in the data section)
Concurrency control is an issue.
Dynamically allocated memory can be global or local -> Program defined

Thread Usage
Model 模型 | Characteristics 特性 |
---|---|
Threads | Parallelism. Blocking system calls |
Single-threaded Process | No parallelism, blocking system calls |
Finite-state Machine | Parallelism, nonblocking system calls, interrupts |
Summary in Threads
- Simpler to program than a state machine
- Less resources associated with threads than multiple complete processes
- Cheaper to create and destroy
- Share resources (especially memory) between them
- Performance: Threads waiting for IO can be overlapped with computing threads
- If all threads are compute bound(被运算速度限制), then there is no performance improvement (on uniprocessor)
- Threads can take advantage of the parallelism available on machines with more than one CPU (multiprocessor)
Lec 03 Concurrency and Synchronisation
Concurrency Example:
Trying to modify a global variable in two threads

The reason for why there is in-kernel concurrency for single-threaded processes:
- Multi-tasking
- Interrupt handling
- Device drivers
- System calls
- Resource sharing

Critical Region and how to identify
A critical region is a region of code where shared resources are accessed.
We can control access to the shared resource by controlling access to the code that accesses the resource.
Uncoordinated entry to the critical region results in a race condition (Concurrency Occurs)
HOW TO IDENTIFY:
Critical regions are the regions of code that
- access a shared resource,
- and correctness relies on the shared resource not being concurrently modified by another thread/process/entity.\
HOW TO PREVENT:

Critical Regions Solutions
Conditions required of any solution to the critical region problem:
- Mutual exclusion:
- No two processes/threads in the critical region at the same time
- No assumptions made about speeds or numbers of CPUs
- No process running outside its critical region may block another process
- No process waited forever to enter its critical region (Bounded)
Mutual exclusion by taking turns
Works for solving concurrency issues: strict alternation -> each process takes turns
Cons (缺点)
- Busy waiting
- Process must wait its turn even while the other process is doing something else
- In a multiple processes condition, must wait for everyone to have a turn
- Does not guarantee progress if a process no longer needs a turn
- Poor solution when processes require the critical section at different rates
- In a multiple processes condition, must wait for everyone to have a turn
Mutual Exclusion by disabling interrupts
How it works:
Before entering a critical region, disable interrupts.
After leaving the critical region, enable interrupts.
Cons:
- Only available in the kernel
- Delays everybody else, even with no contention
- Slows the interrupt response time
- Does not work on a multiprocessor
Hardware Support for mutual exclusion
Test and Set instruction
•Can be used to implement lock variables correctly
•Load the value of the lock
•If lock == 0
•Set the lock to 1
•Return the result 0 -> we acquire the lock
•if lock == 1
• return 1 -> another thread/process has the lock
•Hardware guarantees that the instruction executes atomically -> not be interrupt

Pros:
- Simple (easy to verify work or not)
- Available at user level
- Work with any number of processors
- To implement any number of lock variables
Cons:
- Busy waits (also named as a spin lock)
- Consumes CPU
- Starvation is possible when a process leaves its critical section and more than one process is waiting.
Tackling the busy-wait problem:
Sleep/Wakeup
- When the process is waiting for an event, it calls sleep to block, instead of busy waiting.
- The event happens, the event generator (another process) calls wakeup to unblock the sleeping process.
- Waking a ready/running process has no effect.
Producer-Consumer Problem(bounded buffer problem)
Description: producer produces data items and store items in a buffer.
Consumer takes the items out of the buffer and consumes them.

Issues:
Producer
Should sleep when the buffer is full.
And wakeup when there is empty space in the buffer.
The consumer can call wakeup when it consumes the first entry of the full buffer.
Consumer
Should sleep when the buffer is empty.
And wakeup when there are items available.
Producer can call wakeup when it adds the first item to the buffer.
Semaphores
Two primitives that are more powerful than simple sleep and wakeup alone.
• P(): proberen, from Dutch to test.
• V(): verhogen, from Dutch to increment.
• Also called wait & signal, down & up.
How it works:
If a resource is not available, the corresponding semaphore blocks any process waiting for the resource.
Blocked processes are put into a process queue maintained by the semaphore (avoids busy waiting!)
When a process releases a resource, it signals this by means of the semaphore.
Signaling resumes a blocked process if there is any.
Wait (P) and signal (V) operations cannot be interrupted.
Complex coordination can be implemented by multiple semaphores.


The initial count determines how many waits can progress before blocking and requiring a signal.
Producer Consumer problem with semaphores:


Summarising Semaphores
Semaphores can be used to solve a variety of concurrency problems.
However, programming with then can be error-prone:
E.g. must signal for every wait
To many or too few signals or waits, or signals and waits in the wrong order, can have bad results.
Monitor
A higher level synchronisation primitive.
Programming language construct
IDEA:
A set of procedures/variables/data types are grouped in a special module:monitor.
Variables and data types can only be accessed from within the monitor.
Only one process/thread can be in the monitor at any one time.
Mutual exclusion is implemented by the complier.

When a thread calls a monitor procedure that has a thread already inside, it will be in the entry queue and will sleep until the current thread exits the monitor.

Condition Variable
To allow a process to wait within the monitor, we need to declare a condition variable.
Condition variable can only be used with the operations wait and signal.
x.wait()
Means that the process invoking this operation is suspended until another process invokes.
Another thread can enter the monitor while original is suspended.
x.signal()
this operation resumes only one suspended process.
If no process is suspended, then this operation has no effect.

How to achieve producer-consumer problem with monitors is in P50 lec03.
Synchronisation Primitives in OS/161
Locks

Semaphores

Wait (P) / signal (V)
Condition Variables

Note: All three variants must hold the lock passed in.
Condition Variables and Bounded Buffers

cv_wait() will release the lock before block the process
Producer-consumer solution with condition variable

Dining Philosophers
哲学家就餐问题是一个经典的计算机科学同步问题,由Edsger Dijkstra在1965年提出。问题描述了五位哲学家围坐在圆桌旁的情景,他们在思考问题和吃饭之间交替。圆桌中间有一碗意面,每两位哲学家之间有一根筷子。每位哲学家需要两根筷子才能吃饭。哲学家们不能交谈,只能思考或吃饭。
问题在于设计一个协议,使得每位哲学家都能够在需要时使用两根筷子吃饭,而不会出现死锁(即所有哲学家都在等待筷子,无法继续吃饭)或者饥饿(即某位哲学家长时间无法获得筷子)的情况。
以下是一个可能的解决方案:
我们可以为每根筷子分配一个编号(例如,1到5),并为每位哲学家分配一个编号。每位哲学家在需要吃饭时,先尝试拿起编号较低的筷子,然后尝试拿起编号较高的筷子。当哲学家同时拥有两根筷子时,他们可以开始吃饭。吃完饭后,他们将筷子放回原位,然后继续思考。
这种方法可以避免死锁,因为至少有一位哲学家(具有最低编号筷子的哲学家)可以拿起两根筷子并开始吃饭。其他哲学家则需要等待筷子被放回。虽然这种方法可能导致某些哲学家等待时间较长,但可以确保所有哲学家都有机会吃饭,避免饥饿现象。
Reader and Writer Problem
读者-写者问题是一个经典的计算机科学并发控制问题,用于描述多个进程在访问共享资源时需要进行同步的场景。在这个问题中,有一些读者进程和写者进程。读者进程只读取共享资源,而写者进程可以修改共享资源。问题的挑战在于设计一个同步协议,以允许多个读者进程同时访问共享资源,但在写者进程访问资源时,确保其他进程(读者和写者)无法访问资源。
以下是一个可能的解决方案:
我们可以使用互斥量(mutex)和信号量(semaphore)来实现这个协议。在这个解决方案中,我们需要一个互斥量来保护对共享资源的访问计数器,以及一个信号量来控制对共享资源的访问。
读者进程:
- 请求互斥量以修改访问计数器。
- 将访问计数器加1。如果这是第一个读者进程,请求信号量以阻止写者进程访问共享资源。
- 释放互斥量。
- 访问共享资源。
- 请求互斥量以修改访问计数器。
- 将访问计数器减1。如果这是最后一个读者进程,释放信号量以允许写者进程访问共享资源。
- 释放互斥量。
写者进程:
- 请求信号量以阻止其他进程访问共享资源。
- 访问共享资源。
- 释放信号量以允许其他进程访问共享资源。
这个解决方案允许多个读者进程同时访问共享资源,但当有写者进程需要访问资源时,它们将等待所有当前的读者进程完成。同样,在写者进程访问共享资源时,其他读者和写者进程将被阻止。这种方法可以确保在写者进程访问资源时,没有其他进程可以访问共享资源。
Lec04 DeadLock
DeadLock & Resources
Deadlocks occurs when
- Processes are granted exclusive access to devices/locks/tables..
- We refer to these entities generally as resources.
Deadlock: definition
A set of processes is deadlocked if each process in the set is waiting for an event that only another process in the set can cause.
In the deadlock situation, no process can
- Run
- Release resources
- Be awakened
Four conditions for deadlock
- Mutual exclusion condition
- Each resource assigned to 1 process or is available
- Hold and wait condition
- Process holding resources can request additional
- No preemption condition
- Previously granted resources cannot be forcibly taken away
- Circular wait condition
- Must be a circular chain of 2 or more processes
- Each is waiting for resource held by next member of the chain
Deadlock avoidance
- Just ignore the problem altogether
- Prevention 预防
- Negating on of the four necessary conditions
- Detection and recovery
- Dynamic avoidance
- Careful resource allocation
Method 1: Ostrich Algorithm
Pretend there is no problem
Reasonable if
- Deadlocks occur very rarely
- Cost of prevention is high
UNIX and Windows takes this approach for some of the more complex resource relationships they manage
It’s a trade off between
- Convenience (engineering approach)
- Correctness (mathematical approach)
Method 2: Deadlock prevention
Resource allocation rules prevent deadlock by prevent one of the four conditions required for deadlock from occurring:
- Mutual exclusion
- Hold and wait
- No preemption
- Circular Wait
Attacking the Mutual Exclusion Condition (Not feasible)
Not feasible in general
Some devices/resources are intrinsically(本质上) not sharable.
Attacking the hold and wait condition (request resources initially)
Require processes to request resources before starting
A process never has to wait for what it needs
Issues:
- May not know required resources at start of run
Which means not always possible
- Also ties up resources other processes could be using
Variations:
- Process must give up all resources if it would block holding a resource
- Then request all immediately need
- Prone(倾向于) to livelock
Attacking the No Preemption condition(无优先权条件)(take resources away)
Not a viable option
Attacking the circular wait condition (Order resources)
Numerically ordered resources (Resources ordering is a common technique)
Method 3: Detection and recovery
Need a method to determine if a system is deadlocked.
Assuming deadlocked is detected, we need a method of recovery to restore progress to the system.
Strategy:
Note the resource ownership and requests
A cycle can be found within the graph, which denotes a deadlock
Detection with Multiple Resources of Each Type

Sum of current resource allocation + resources available = resources that exist

Algorithm
- Look for an unmarked process Pi, for which the i-th row of R is less than or equal to A.
- If found, add the i-th row of C to A, and mark Pi, then go back to step 1.
- If no such process exists, terminate.
The remaining processes are deadlocked.
Recovery from Deadlock
Recovery through preemption
• take a resource from some other process
• depends on nature of the resource
Recovery through rollback
• checkpoint a process periodically
• use this saved state
• restart the process if it is found deadlocked
• No guarantee is won’t deadlock again
Recovery through killing processes
• crudest but simplest way to break a deadlock
• kill one of the processes in the deadlock cycle
• the other processes get its resources
• choose process that can be rerun from the beginning
Method 4: Deadlock Avoidance
Only enough information is available in advance, we can avoid a deadlock.
- Maximum number of each resource required

Safe and Unsafe states
A state is safe if
• The system is not deadlocked
• There exists a scheduling order that results in every process running
to completion, even if they all request their maximum resources
immediately
Unsafe states are not necessarily deadlocked
• With a lucky sequence, all processes may complete
• However, we cannot guarantee that they will complete
(not deadlock)
• Safe states guarantee we will eventually complete all processes
•Deadlock avoidance algorithm
• Only grant requests that result in safe states
LiveLock
Livelocked processes are not blocked, change state regularly, but never make progress

Bankers Algorithm
Modelled on a Banker with Customers
• The banker has a limited amount of money to loan customers
• Limited number of resources
• Each customer can borrow money up to the customer’s credit limit
• Maximum number of resources required
Basic Idea
• Keep the bank in a safe state
• So all customers are happy even if they all request to borrow up to their
credit limit at the same time.
• Customers wishing to borrow such that the bank would enter an unsafe state must wait until somebody else repays their loan such that the the
transaction becomes safe
Starvation
A process never receives the resource it is waiting for, despite the resource (repeatedly) becoming free, the resource is always allocated to another waiting process
One Solution: First come, first server
Lec05 Processes and Threads Implementation
MIPS Registers
User-mode accessible registers
• 32 general purpose registers
• r0 hardwired to zero
• r31 the link register for jump-and-link
(JAL) instruction
HI/LO
• 2 * 32-bits for multiply and divide
PC
• Not directly visible
• Modified implicitly by jump and branch instructions
Branching and Jumping
Branching and jumping have a branch delay slot
• The instruction following a branch or jump is always executed prior to destination of jump
Process
Memory allocation
Minimally consist of three segments
• Text
• contains the code (instructions)
• Data
• Global variables
• Stack
• Activation records of procedure/function/method
• Local variables
Note:
• data can dynamically grow up
• E.g., malloc()-ing
• The stack can dynamically grow down
• E.g., increasing function call depth or recursion

Mode distinguish
User-mode
• Processes (programs) scheduled by the kernel
• Isolated from each other
• No concurrency issues between each other
System-calls transition into and return from the kernel
Kernel-mode
• Nearly all activities still associated with a process
• Kernel memory shared between all processes
• Concurrency issues exist between processes concurrently executing in a system call
User-level Threads

Implementation at user-level
• User-level Thread Control Block (TCB), ready queue, blocked queue, and dispatcher
• Kernel has no knowledge of the threads (it only sees a single process)
• If a thread blocks waiting for a resource held by another thread inside
the same process, its state is saved and the dispatcher switches to another ready thread
• Thread management (create, exit, yield, wait) are implemented in a runtime support library
Pros
• Thread management and switching at user level is much faster than doing it in kernel level
• No need to trap (take syscall exception) into kernel and back to switch
• Dispatcher algorithm can be tuned to the application
• E.g. use priorities
• Can be implemented on any OS (thread or non-thread aware)
• Can easily support massive numbers of threads on a per-application
basis
• Use normal application virtual memory
• Kernel memory more constrained. Difficult to efficiently support wildly differing numbers of threads for different applications
Cons
• Threads have to yield() manually (no timer interrupt delivery to user level)
• Co-operative multithreading
• A single poorly design/implemented thread can monopolise the available CPU time
• There are work-arounds (e.g. a timer signal per second to enable pre-emptive multithreading), they are course grain and a kludge.
• Does not take advantage of multiple CPUs (in reality, we still have a single threaded process as far as the kernel is concerned)
• If a thread makes a blocking system call (or takes a page fault), the process (and all the internal threads) blocks
• Can’t overlap I/O with computation
Kernel-provided Threads

Threads are implemented by the kernel
• TCBs (thread control block) are stored in the kernel
• A subset of information in a traditional PCB (process control block)
• The subset related to execution context
• TCBs have a PCB associated with them
• Resources associated with the group of threads (the process)
• Thread management calls are implemented as system calls
• E.g. create, wait, exit
Cons
• Thread creation and destruction, and blocking and unblocking threads requires kernel entry and exit.
• More expensive than user-level equivalent
Pros
• Preemptive multithreading
• Parallelism
• Can overlap blocking I/O with computation
• Can take advantage of a multiprocessor
Context Switch
Context Switch is what the lowest level of OS does when an interrupt occurs.
A context switch can refer to
• A switch between threads
• Involving saving and restoring of state associated with a thread
• A switch between processes
• Involving the above, plus extra state associated with a process.
• E.g. memory maps
Context Switch Occurrence
A switch between process/threads can happen any time the OS is invoked
• On a system call
• Mandatory if system call blocks or on exit();
• On an exception
• Mandatory if offender is killed
• On an interrupt
• Triggering a dispatch is the main purpose of the timer interrupt
A thread switch can happen between any two instructions
Note instructions do not equal program statements
Context Switch Addition
- Context switch must be transparent for processes/threads
• When dispatched again, process/thread should not notice that something else was running in the meantime (except for elapsed time)
- OS must save all state that affects the thread
• This state is called the process/thread context
• Switching between process/threads consequently
results in a context switch.
Lec06 Syscalls
Syscall Definition
Can be viewed as special function calls
• Provides for a controlled entry into the kernel
• While in kernel, they perform a privileged operation
• Returns to original caller with the result
The system call interface represents the abstract machine provided by the operating system.
CPU Computation Model
The fetch-execute cycle
• Load memory contents from address in program counter (PC)
• The instruction
• Execute the instruction
• Increment PC
• Repeat

Privileged-mode operation
To protect operating system execution, two or more CPU modes of operation exist
• Privileged mode (system-, kernel-mode)
• All instructions and registers are available
• User-mode
• Uses ‘safe’ subset of the instruction set
• Only affects the state of the application itself
• They cannot be used to uncontrollably interfere with OS
• Only ‘safe’ registers are accessible
The accessibility of addresses within an address space changes depending on operating mode
• To protect kernel code and data
• Note: The exact memory ranges are usually configurable, and vary between CPU architectures and/or operating systems.

System call mechanism securely transfers from user execution to kernel execution and back.
System call mechanism overview
• Processor mode
• Switched from user-mode to kernel-mode
• Switched back when returning to user mode
• Stack Pointer (SP)
• User-level SP is saved and a kernel SP is initialised
• User-level SP restored when returning to user-mode
• Program Counter (PC)
• User-level PC is saved and PC set to kernel entry point
• User-level PC restored when returning to user-level
• Kernel entry via the designated entry point must be strictly enforced
•Registers
• Set at user-level to indicate system call type and its arguments
• A convention between applications and the kernel
• Some registers are preserved at user-level or kernel-level in order to restart user-level execution
• Depends on language calling convention etc.
• Result of system call placed in registers when returning to user-level
• Another convention
We need system calls because function calls do not:
- Change from user to kernel mode/back again
- Restrict possible entry points to secure locations
Coprocessor 0 (CP0)
The processor control registers are located in CP0
- Exception/Interrupt management registers
- Translation management registers
Exception management

C0_status
We only focus on 0-15 bits

·c0_cause
The 2nd-6th bit is for ExcCode -> The code number of the exception taken
C0_epc
The Exception Program Counter
• Points to address of where to restart execution after handling the exception or interrupt
Hardware Exception handling
Basic situation: Assume an interrupt occurred as the previous instruction completed & We are in user mode with interrupts enabled.
Steps:
- Instruction address at which to restart after the interrupt is transferred to EPC
- Interrupts disabled and previous state shifted along
- What stored in IEc/KUc is now in IEp/KUp
- Kernel mode is set, and previous state shifted along
- What stored in IEp/KUp (before 2 a) executed) is now in IEo/KUo
- Code for the exception placed in Cause -> ExcCode in c0_cause is now 0 (0 is the code for interrupt)
- Address of general exception vector placed in PC
- CPU now running in kernel mode at the address in PC, with interrupt disabled
- All information required to
- Find out what caused the exception
- Restart after exception handling
- All information required to
Is in coprocessor registers
- (Ignore how the OS handles the exception)
- Load the contents of EPC
- Store the EPC back in the PC
- In the branch delay slot (happens when we jump back to the PC/user mode), execute a restore from exception instruction -> move back KU/IE
- Now back in the same state we were when the exception happened
MIPS System Calls
• System calls are invoked via a syscall instruction.
• The syscall instruction causes an exception and transfers control to the general exception handler
• A convention (an agreement between the kernel and
applications) is required as to how user-level software
indicates
• Which system call is required
• Where its arguments are
• Where the result should go
OS/161 Systems Calls
OS/161 uses the following conventions:
• Arguments are passed and returned via the normal C function calling convention
• Additionally
• Reg v0 contains the system call number
• On return, reg a3 contains
• 0: if success, v0 contains successful result
• not 0: if failure, v0 has the errno.
• v0 stored in errno
• -1 returned in v0
Summary of System Call in User Mode
• From the caller’s perspective, the read() system call behaves like a normal function call
• It preserves the calling convention of the language
•However, the actual function implements its own convention by agreement with the kernel
• Our OS/161 example assumes the kernel preserves appropriate registers(s0-s8, sp, gp, ra).
•Most languages have similar libraries that interface with the operating system.
System Calls – Kernel Side
• Things left to do
• Change to kernel stack
• Preserve registers by saving to memory (on the kernel stack)
• Leave saved registers somewhere accessible to
• Read arguments
• Store return values
• Do the “read()”
• Restore registers
• Switch back to user stack
• Return to application
Lec07 Computer Hardware Review
Memory Hierarchy

Cashing
•Given two-levels of data storage: small and fast, versus large and slow,
• Can speed access to slower storage by using intermediate-speed storage as a cache.
CPU cache

• CPU cache is fast memory placed between the CPU and main memory
• 1 to a few cycles access time compared to RAM access time of tens – hundreds of cycles
• Holds recently used data or instructions to save memory accesses.
• Matches slow RAM access time to CPU speed if high hit rate
• Is hardware maintained and (mostly) transparent to software
• Sizes range from few kB to tens of MB.
• Usually a hierarchy of caches (2–5 levels), on- and off-chip
Performance
The performance depends on the hit rate in the first level.
Effective Access Time

Avoid Waiting for Disk Access
• Keep a subset of the disk’s data in main memory
- OS uses main memory as a cache of disk contents
Avoid Waiting for Internet Access
• Keep a subset of the Internet’s data on disk
- Application uses disk as a cache of the internet
Lec08 File Management
File Names
File system must provide a convenient naming scheme
• Textual Names
• May have restrictions
• Only certain characters
• E.g. no ‘/’ characters
• Limited length
• Only certain format
• E.g DOS, 8 + 3
• Case (in)sensitive
• Names may obey conventions (.c files for C files)
• Interpreted by tools (e.g. UNIX)
• Interpreted by operating system (e.g. Windows “con:”)
File Structure
• Sequence of Bytes
• OS considers a file to be unstructured
• Applications can impose their own structure
• Used by UNIX, Windows, most modern OSes

File Types
• Regular files
•Directories
•Device Files
–May be divided into
•Character Devices – stream of bytes
•Block Devices
•Some systems distinguish between regular file types
–ASCII text files, binary files
File Access Types (Patterns)
•Sequential access 顺序读写
–read all bytes/records from the beginning
–cannot jump around, could rewind or back up
–convenient when medium was magnetic tape
•Random access 随机读写
–bytes/records read in any order
–essential for data base systems
–read can be …
•move file pointer (seek), then read or
–lseek(location,…);read(…)
•each read specifies the file pointer
–read(location,…)
Typical File Operations
- Create
- Delete
- Open
- Close
- Read
- Write
- Append
- Seek
- Get attributes
- Set Attribute
- Rename
Criteria for File Organization
Things to consider when designing file layout
•Rapid access
–Needed when accessing a single record
–Not needed for batch mode
•read from start to finish
•Ease of update
–File on CD-ROM will not be updated, so this is not a concern
•Economy of storage
–Should be minimum redundancy in the data
–Redundancy can be used to speed access such as an index
File Directories
•Provide mapping between file names and the files themselves
•Contain information about files
–Attributes
–Location
–Ownership
•Directory itself is a file owned by the operating system
Hierarchical (Tree Structured) Directory
•Files can be located by following a path from the root, or master, directory down various branches
–This is the absolute pathname for the file
•Can have several files with the same file name as long as they have unique path names(同一路径下的文件名不得相等)
Relative and Absolute Pathnames
•Absolute pathname
–A path specified from the root of the file system to the file
A Relative pathname
–A pathname specified from the cwd (current working directory)
Typical Directory Operations
- Create
- Delete
- Opendir
- Closedir
- Readdir
- Rename
- Link
- Unlink
Nice Properties of UNIX naming
•Simple, regular format
–Names referring to different servers, objects, etc., have the same syntax.
•Regular tools can be used where specialised tools would be otherwise be needed.
•Location independent
–Objects can be distributed or migrated, and continue with the same names
Access Rights
In multiuser system, we need to consider the issue of the access rights.
Different kinds of Access Rights:
- None
- User may not know the existence of the file
- User is not allowed to read the directory that includes the file
- Knowledge
- User can only determine that the file exists and who its owner is
- Execution
- The user can load and execute a program but cannot copy it
- Reading
- The user can read the file for any purpose, including copying and execution
- Appending
- The user can add data to the file but cannot modify or delete any of the file’s contents
- Updating
- The user can modify, delete, and add to the file’s data. This includes creating the file, rewriting it, and removing all or part of the data
- Changing protection
- User can change access rights granted to other users
- Deletion
- User can delete the file
- Owners
- Has all rights previously listed
- May grant rights to others using the following classes of users
- •Specific user
- •User groups
- •All for public files
Simultaneous Access
In multiuser system, we need to consider the issue of the simultaneous access.
•Most OSes provide mechanisms for users to manage concurrent access to files
–Example: flock(), lockf(), system calls
•Typically
–User may lock entire file when it is to be updated
–User may lock the individual records (i.e. ranges) during the update
•Mutual exclusion and deadlock are issues for shared access
Lec09 File System Internals
UNIX Storage Stack
From lower to Higher:
- Hard disk platters:
- Tracks
- Sectors
- Disk controller
- Hides disk geometry, bad sectors
- Exposes linear sequence of blocks(disk’s interface)
- Device driver
- Hides device-specific protocol
- Exposes block-device interface
- Disk scheduler/buffer cache -> Optimisations
- Keep recently accessed disk blocks in memory
- Schedule disk accesses from multiple processes for performance and fairness
- File System
- Hides physical location of data on the disk
- Exposes:
- Directory hierarchy 目录的层次结构
- Symbolic file names
- Random-access files
- Protrction
- Virtual File System (VFS)
- Unified interface to multiple File systems
- Open File Table (OF table)/File Descriptor Table (FD table)
- Keep track of files opened by user-level processes
- Matches syscall interface to VFS interface

Difference between some FS
Different physical nature of storage devices
– Ext3 is optimised for magnetic disks
– JFFS2 is optimised for flash memory devices
– ISO9660 is optimised for CDROM
Different storage capacities
– FAT16 does not support drives >2GB
– FAT32 becomes inefficient on drives >32GB
– ZFS, Btrfs is designed to scale to multi-TB disk arrays
Different CPU and memory requirements
– FAT16 is not suitable for modern PCs but is a good fit for many embedded devices
Proprietary standards
– NTFS may be a nice FS, but its specification is closed
Property of hard disk
– Seek time
• ~15ms worst case
– Rotational delay
• 8ms worst case for 7200rpm drive
– For comparison, disk-to-buffer transfer speed of a modern drive is ~10µs per 4K block
Conclusion: keep blocks that are likely to be accessed together close to each other
Implementing a file system
Requirements
The FS must map symbolic file names into a collection of block addresses.
The FS must keep track of
– which blocks belong to which files.
– in what order the blocks form the file
– which blocks are free for allocation
Given a logical region of a file, the FS must track the corresponding block(s) on disk.

File Allocation Methods
Contiguous Allocation
✔ Easy bookkeeping (need to keep track of the starting block and length of the file)
✔ Increases performance for sequential operations
✗ Need the maximum size for the file at the time of creation
✗ As files are deleted, free space becomes divided into many small chunks (external fragmentation)
Example: ISO 9660 (CDROM)

Dynamic Allocation Strategies
– Disk space allocated in portions as needed
– Allocation occurs in fixed-size blocks
✔ No external fragmentation
✔ Does not require pre-allocating disk space
✗ Partially filled blocks (internal fragmentation)
✗ File blocks are scattered across the disk
✗ Complex metadata management (maintain the collection of blocks
for each file)

Examples come after:
External and Internal fragmentation
External fragmentation
– The space wasted external to the allocated memory regions
– Memory space exists to satisfy a request but it is unusable as it is not contiguous
Internal fragmentation
– The space wasted internal to the allocated memory regions
– Allocated memory may be slightly larger than requested memory; this size difference is wasted memory internal to a partition
Conclusion
Internal fragmentation is inside the allocated memory blocks
External fragmentation is between two allocated memory blocks
Dynamic allocation : Linked list allocation
Each block contains the block number of the next block in the chain. Free blocks are also linked in a chain.
✔ Only single metadata entry per file
✔ Best for sequentially accessed files
✗ Poor for random access
✗ Blocks end up scattered across the disk due to free list eventually being randomised

Dynamic Allocation: File allocation Table (FAT)
• Keep a map of the entire FS in a separate table
– A table entry contains the number of the next block of the file
– The last block in a file and empty blocks are marked using reserved values
• The table is stored on the disk and is replicated in memory
• Random access is fast (following the in-memory list)
Issues:
Requires a lot of memory for large disks
File allocation table disk layout

Dynamical Allocation: inode-based FS structure
Idea: separate table (index-node or i-node) for each file.
– Only keep table for open files in memory
– Fast random access
The most popular FS structure today

Issues:
- i-nodes occupy one or several disk areas
- i-nodes are allocated dynamically, hence free-space management is required for i-nodes
- – Use fixed-size i-nodes to simplify dynamic allocation
- – Reserve the last i-node entry for a pointer (a block number) to an extension i-node

- Free-space management
- Approach 1: Linked list of free blocks in free blocks on disk
- Keep bitmaps of free blocks and free i-nodes on disk

Free block List
List of all unallocated blocks
Background jobs can re-order list for better contiguity
Store in free blocks themselves
– Does not reduce disk capacity
Only one block of pointers need be kept in the main memory
Bit tables
Individual bits in a bit vector flags used/free blocks
16GB disk with 512-byte blocks –> 4MB table
May be too large to hold in main memory
Expensive to search
– Optimisations possible, e.g. a two level table
Concentrating (de)allocations in a portion of the bitmap has desirable effect of concentrating access
Simple to find contiguous free space
Implementing directories
• Directories are stored like normal files
– directory entries are contained inside data blocks
• The FS assigns special meaning to the content of these files
– a directory file is a list of directory entries
– a directory entry contains file name, attributes, and the file i-node number
• maps human-oriented file name to a system-oriented name
Fixed-size directory entries
– Either too small
• Example: DOS 8+3 characters
– Or waste too much space
• Example: 255 characters per file name
Variable-size directory entries
– Freeing variable length entries can create external fragmentation in directory blocks
• Can compact when block is in RAM
Searching Directory methods
- Linear scan
- Implement a directory cache in software to speed-up search
- Hash lookup
- B-tree
Storing file attributes
- Disk addresses and attributes in directory entry -> FAT
- Directory in which each entry just refers to an i-node -> UNIX

File system block size
File systems deal with 2 types of blocks
– Disk blocks or sectors (usually 512 bytes)
– File system blocks 512 * 2^N bytes
What should N be will be the problem.
• Smaller blocks waste less disk space (less internal fragmentation)
• Sequential Access
– The larger the block size, the fewer I/O operations required
• Random Access
– The larger the block size, the more unrelated data loaded.
– Spatial locality of access improves the situation
• Choosing an appropriate block size is a compromise
Lec10 UNIX File Management (continue)
Virtual file system (VFS)
Traversing the directory hierarchy may require VFS to issue requests to several underlying file systems.
• Provides single system call interface for many file systems
– E.g., UFS, Ext2, XFS, DOS, ISO9660,…
• Transparent handling of network file systems
– E.g., NFS, AFS, CODA
• File-based interface to arbitrary device drivers (/dev)
• File-based interface to kernel data structures (/proc)
• Provides an indirection layer for system calls
– File operation table set up at file open time
– Points to actual handling code for particular type
– Further file operations redirected to those functions

Data Types in VFS Interface
VFS
- Represent all file system types
- Contains pointers to functions to manipulate each file system as a whole
- Form a standard interface to the file system
Vnode
- Represents a file (inode) in the underlying filesystem
- Points to the real inode
- Contains pointers to funcions to manipulate files/inodes like open/close/read
File descriptors
Attributes
- Each open file has a file descriptor
- Read/Write/Lseek/… use them to specify which file to operate on.
State associated with a file descriptor
- File pointer
- Determine where in the file the next read or write is performed
- Mode
- Was the file opened with read-only permission or sth…
Use single fd table with no of table
How to achieve:
Use vnode numbers as file descriptors and add a file pointer to the vnode.
Problem:
Cannot handling concurrency situations.

At this situation:
- we only have a single global open file array.
- Entries contains file pointer and a pointer to a vnode.
Issues:
File descriptor 1 is stdout -> console for some processes but may be a file for other processes
Entry 1 should be different different per process!
Per-process File descriptor array

Each process has its own open file array
Issue
- Fork
- Fork defines that the child shares the file pointer with the parent
- Dup2
- Also defines the file descriptors share the file pointer
- With per-process table, we can only have independent file pointers, even when accessing the same file
Per-Process fd table with global open file table

Per-process file descriptor array contains pointers to open file table entry
Open file table array contains entries with a file pointer and a pointer to an vnode.
This model provides
- Shared file pointers if required
- Independent file pointers if required
This model used by linux and most other UNIX operating systems
Buffer
–Temporary storage used when transferring data between two entities
•Especially when the entities work at different rates
•Or when the unit of transfer is incompatible
•Example: between application program and disk
Buffer disk blocks
•Allow applications to work with arbitrarily sized region of a file
–However, apps can still optimise for a particular block size
•Writes can return immediately after copying to kernel buffer
–Avoids waiting until write to disk is complete
–Write is scheduled in the background
•Can implement read-ahead by pre-loading next block on disk into kernel buffer
–Avoids having to wait until next read is issued
Cache
Fast storage used to temporarily hold data to speed up repeated access to the data -> such as main memory can cache disk blocks
•On access
–Before loading block from disk, check if it is in cache first
•Avoids disk accesses
•Can optimise for repeated access for single or several processes
Buffer and caching are related
•Data is read into buffer; an extra independent
cache copy would be wasteful
•After use, block should be cached
•Future access may hit cached copy
•Cache utilises unused kernel memory space;
–may have to shrink, depending on memory demand
Cache replacement
When the buffer cache is full and we need to read another block into memory, we must choose an existing entry to replace
Policies:
- FIFO (first in first out)
- Least recently used
Etc.
File system consistency
Generally, cached disk blocks are prioritised in terms of how critical they are to file system consistency.
–Directory blocks, inode blocks if lost can corrupt entire filesystem
•These blocks are usually scheduled for immediate write to disk
–Data blocks if lost corrupt only the file that they are associated with
•These blocks are only scheduled for write back to disk periodically
•In UNIX, flushd (flush daemon) flushes all modified blocks to disk every 30 seconds
Possible Solution
Write-through cache
- All modified blocks are written immediately to disk
But this will generate more disk traffic
Works in the following situations:
- Floppies were removed from drives
- Users were constantly resetting (or crashing) their machines
This method is still be used in USB storage.
Lec11 ext2 FS
Descriptions:
• Second Extended Filesystem
– The main Linux FS before ext3
– Evolved from Minix filesystem (via “Extended Filesystem”)
• Features
– Block size (1024, 2048, and 4096) configured at FS creation
– inode-based FS
– Performance optimisations to improve locality (from BSD FFS)
• Main Problem: unclean unmount -> e2fsck
– Ext3fs keeps a journal of (meta-data) updates
– Journal is a file where updates are logged
– Compatible with ext2fs
i-node
• Each file is represented by an inode on disk
• Inode contains the fundamental file metadata
– Access rights, owner, accounting info
– (partial) block index table of a file
• Each inode has a unique number
– System oriented name
• Directories map file names to inode numbers
– Map human-oriented to system-oriented names
Inode contents
- Mode
- Type
- Regular file or directory
- Access mode
- Rwxrwxrwx
- Type
- Uid
- User ID
- Gid
- Group ID
- Atime
- Time of last access
- Ctime
- Time when the file was created
- Mtime
- Time when file was modified
- Size
- Offset of the highest byte written
- Block count
- Number of disk blocks used by the file
- Note that number of blocks can be much less than exected given the file size
IMPORTANT:
Files can be sparsely populated.
Such as using lseek. the blocks between them might be empty.
So the size is not the same as the block count.
- Direct Blocks
- Block numbers of the first 12 blocks in the file
- Most files are small
- We can find block of file directly from the inode.
For larger files, we store the data blocks’ offsets greater than 12 blocks in single/double/triple indirect.
- Single indirect
- Requires two disk access to read
- One for the indirect block, one for the target block.
- Max file size
- Assume 1K byte block size, 4 byte block numbers
- 12 * 1K + 1K/4 * 1K = 268 KiB
- Assume 1K byte block size, 4 byte block numbers
- Requires two disk access to read
So for large majority of files (<268KiB), given the inode, only one or two further accesses required to read any block in file.
- Double indirect
- –Block number of a block containing block numbers of blocks containing block numbers -> the second level is also block numbers of blocks, the third level has the block numbers
- Triple indirect
- Block number of a block containing block numbers of blocks containing block numbers of blocks containing block numbers


Max File Size
Assume 4 bytes block numbers and 1K blocks
• The number of addressable blocks
– Direct Blocks = 12
– Single Indirect Blocks = 256
– Double Indirect Blocks = 256 * 256 = 65536
– Triple Indirect Blocks = 256 * 256 * 256 = 16777216
• Max File Size
12 + 256 + 65536 + 16777216 = 16843020 blocks ≈ 16 GB
Example:
• Assume 4K blocks, 4 byte block numbers, 12 direct blocks
what if we add
– lseek(fd, 5242880, SEEK_SET) /* 5 megabytes */
– write(fd, “x”, 1)

Address = 5242880 ==>
Block number = 5242880/4096 =1280
Double indirect offset (20-bit)
= 1280 – 1036
= 244
Top 10 bits = 0 -> first level block’s 0th entry
Lower 10 bits = 244-> second level block 244th entry
Best and worse case access patterns
Assume inode already in memory
To read 1 byte
– Best:
• 1 access via direct block
– Worst:
• 4 accesses via the triple indirect block
To write 1 byte
– Best:
• 1 write via direct block (with no previous content)
– Worst:
• 4 reads (to get previous contents of block via triple indirect) + 1 write (to write modified block back)
Worst Case Access Patterns with Unallocated Indirect Blocks
• Worst to write 1 byte
– 4 writes (3 indirect blocks; 1 data)
- All indirect are new
– 1 read, 4 writes (read-write 1 indirect, write 2; write 1 data)
- First indirect has something
– 2 reads, 3 writes (read 1 indirect, read-write 1 indirect, write 1; write 1 data)
- First and second indirect have something
– 3 reads, 2 writes (read 2, read-write 1; write 1 data)
- All three indirect have something
• Worst to read 1 byte
– If reading writes a zero-filled block on disk
– Worst case is same as write 1 byte
– If not, worst-case depends on how deep is the current indirect block tree.
Inode Summary
The inode (and indirect blocks) contains the on-disk metadata associated with a file
– Contains mode, owner, and other bookkeeping
– Efficient random and sequential access via indexed allocation
– Small files (the majority of files) require only a single access
– Larger files require progressively more disk accesses for random access
• Sequential access is still efficient
– Can support really large files via increasing levels of indirection
System V Disk Layout (s5fs)

– Boot Block
• contain code to bootstrap the OS
– Super Block
• Contains attributes of the file system itself
• e.g. size, number of inodes, start block of inode array, start of data block area, free inode list, free data block list
– Inode Array
– Data blocks
Problems
• Inodes at start of disk; data blocks end
– Long seek times
• Must read inode before reading data blocks
• Only one superblock
– Corrupt the superblock and entire file system is lost
• Block allocation was suboptimal
– Consecutive free block list created at FS format time
• Allocation and de-allocation eventually randomises the list resulting in random allocation
• Inode free list also randomised over time
– Directory listing resulted in random inode access pattern
Berkeley Fast Filesystem (FFS)
•Historically followed s5fs
–Addressed many limitations with s5fs
–ext2fs mostly similar
Layout of an Ext2 FS

•Partition:
–Reserved boot block,
–Collection of equally sized block groups
–All block groups have the same structure
Layout of a block group

•Replicated super block
–For e2fsck
•Replicated Group descriptors
•Bitmaps identify used inodes/blocks
•All block groups have the same number of data blocks
•Advantages of this structure:
–Replication simplifies recovery
–Proximity of inode tables and data blocks (reduces seek time)
Superblocks
•Size of the file system, block size and similar parameters
•Overall free inode and block counters
•Data indicating whether file system check is needed:
–Uncleanly unmounted
–Inconsistency
–Certain number of mounts since last check
–Certain time expired since last check
•Replicated to provide redundancy to aid recoverability
Group Descriptors
•Location of the bitmaps
•Counter for free blocks and inodes in this group
•Number of directories in the group
•Replicated to provide redundancy to aid recoverability
这些信息有助于文件系统定位和管理每个块组的资源。
Performance considerations
EXT2 optimisations
– Block groups cluster related inodes and data blocks
–Pre-allocation of blocks on write (up to 8 blocks)
•8 bits in bit tables
•Better contiguity when there are concurrent writes
–Aim to store files within a directory in the same group
Ext2 Directories
•Directories are files of a special type
•Consider it a file of special format, managed by the kernel, that uses most of the same machinery to implement it
–Inodes, etc…
•Directories translate names to inode numbers
•Directory entries are of variable length
•Entries can be deleted in place
•inode = 0
•Add to length of previous entry


“f1” = inode 7
“file2” = inode 43
“f3” = inode 85
Hard Links
Inodes can have more than one name -> hard link

Reference Count in inode
Keep a reference count in the inode, so:
When we delete a file by name, i.e. remove the directory entry (link), the file system know when to delete the underlying inode
•Adding a name (directory entry) increments the count
•Removing a name decrements the count
•If the reference count == 0, then we have no names for the inode (it is unreachable), we can delete the inode (underlying file or directory)
Symbolic links
• A symbolic link is a file that contains a reference to another file or directory
– Has its own inode and data block, which contains a path to the target file
– Marked by a special file attribute
– Transparent for some operations
– Can point across FS boundaries -> symbolic link only care about the path of the certain file, instead of the destination.
File system reliability
• Disk writes are buffered in RAM
– OS crash or power outage ==> lost data
– Commit writes to disk periodically (e.g., every 30 sec)
– Use the sync command to force a FS flush
• FS operations are non-atomic
– Incomplete transaction can leave the FS in an inconsistent state
Other examples
• e2fsck
– Scans the disk after an unclean shutdown and attempts to restore FS invariants
• Journaling file systems
– Keep a journal of FS updates
– Before performing an atomic update sequence,
– write it to the journal
– Replay the last journal entries upon an unclean shutdown
– Example: ext3fs
Lec12 ext3 FS
Design goals
– Add journaling capability to the ext2 FS
– Backward and forward compatibility with ext2
• Existing ext2 partitions can be mounted as ext3
– Leverage the proven ext2 performance
– Reuse most of the ext2 code base
– Reuse ext2 tools, including e2fsck
The ext3 journal
Option 1: Journal FS data structure updates
Example:
– Start transaction
– Delete dir entry
– Delete i-node
– Release blocks 32, 17, 60
– End transaction
Conclusion
✔Efficient use of journal space; hence faster journaling
✘ Individual updates are applied separately
✘ The journaling layer must understand FS semantics
Option 2: Journal disk block updates
Example:
– Start transaction
– Update block #n1 (contains the dir entry)
– Update block #n2 (i-node allocation bitmap)
– Update block #n3 (data block allocation bitmap)
– Add transaction
Conclusion
✗ Even a small update adds a whole block to the journal
✔ Multiple updates to the same block can be aggregated into a single update
✔ The journaling layer is FS independent (easier to implement)
Ext3 use this method
Journaling Block Device (JBD)

JBD interface
– Start a new transaction
– Update a disk block as part of a transaction
– Complete a transaction
• Completed transactions are buffered in RAM
– Commit: write transaction data to the journal (persistent storage)
• Multiple FS transactions are committed in one go
– Checkpoint: flush the journal to the disk
• Used when the journal is full or the FS is being unmounted

Journaling modes
Ext3 supports two journaling modes
– Metadata+data
• Enforces atomicity of all FS operations
– Metadata journaling
• Metadata is journalled
• Data blocks are written directly to the disk
• Improves performance
• Enforces file system integrity
• Does not enforce atomicity of write’s
– New file content can be stale blocks
JBD conclusion
JBD
• JBD can keep the journal on a block device or in a file
– Enables compatibility with ext2 (the journal is just a normal file)
• JBD is independent of ext3-specific data structures
– Separation of concerns
• The FS maintains on-disk data and metadata
• JBD takes care of journaling
– Code reuse
• JBD can be used by any other FS that requires journaling
Lec13 memory Management
Fixed Partition Summary
• Simple
• Easy to implement
•Can result in poor memory utilisation
• Due to internal fragmentation
•Used on IBM System 360 operating system (OS/MFT)
• Announced 6 April, 1964
• Still applicable for simple embedded systems
• Static workload known in advance
Dynamic Partitioning
Allocation
•Also applicable to malloc()-like in-application allocators
•Given a region of memory, basic requirements are:
• Quickly locate a free partition satisfying the request
• Minimise CPU time search
• Minimise external fragmentation
• Minimise memory overhead of bookkeeping
• Efficiently support merging two adjacent free partitions into a larger partition
Classic Approach
Represent available memory as a linked list of available “holes” (free memory ranges).
• Base, size
• Kept in order of increasing address
• Simplifies merging of adjacent holes into larger holes.
• List nodes be stored in the “holes” themselves
Placement
First-fit algorithm
• Scan the list for the first entry that fits
• If greater in size, break it into an allocated and free part
• Intent: Minimise amount of searching performed
• Aims to find a match quickly
• Biases allocation to one end of memory
• Tends to preserve larger blocks at the end of memory
Next-fit
Like first-fit, except it begins its search from the point in list where the last request succeeded instead of at the beginning.
• (Flawed) Intuition: spread allocation more uniformly over entire memory to avoid skipping over small holes at start of memory
• Performs worse than first-fit as it breaks up the large free space at
end of memory
Best-fit
• Chooses the block that is closest in size to the request
• Performs worse than first-fit
• Has to search complete list
• does more work than first-fit
• Since smallest block is chosen for a process, the smallest amount of external fragmentation is left
• Create lots of unusable holes
Worst-fit
• Chooses the block that is largest in size (worst-fit)
• (whimsical) idea is to leave a usable fragment left over
• Poor performer
• Has to do more work (like best fit) to search complete list
• Does not result in significantly less fragmentation
Summary
First-fit generally better than the others and easiest to implement
They are simple solutions to a still-existing OS or application service/function –memory allocation.
Largely have been superseded by more complex and specific allocation strategies
Compaction
We can reduce external fragmentation by compaction
• Shuffle memory contents to place all free memory together in one large block.
• Only if we can relocate running programs?
• Pointers?
• Generally requires hardware support
Issues
• Relocation
• How does a process run in different locations in memory?
• Protection
• How do we prevent processes interfering with each other
When are memory address bound?
• Compile/link time
• Compiler/Linker binds the addresses
• Must know “run” location at compile time
• Recompile if location changes
• Load time
• Compiler generates relocatable code
• Loader binds the addresses at load time
• Run time
• Logical compile-time addresses translated to physical addresses by special hardware.

Hardware Support for Runtime Binding and Protection
• For process to run using logical addresses
• Process expects to access addresses from zero to some limit of memory size
• Need to add an appropriate offset to its logical addresses
• Achieve relocation
• Protect memory “lower” than the process
• Must limit the maximum logical address the process can generate
• Protect memory “higher” than the process

Base and Limit Registers
•Also called
• Base and bound registers
• Relocation and limit registers
•Base and limit registers
• Restrict and relocate the currently active process
• Base and limit registers must be changed at
• Load time
• Relocation (compaction time)
• On a context switch
Pros
• Supports protected multi-processing (-tasking)
Cons
• Physical memory allocation must still be contiguous
• The entire process must be in memory
• Do not support partial sharing of address spaces
• No shared code, libraries, or data structures between processes
Swapping
• A process can be swapped temporarily out of memory to a backing store, and then brought back into memory for continued execution.
• Swapping involves transferring the whole process
• Backing store – fast disk large enough to accommodate copies of all memory images for all users; must provide direct access to these memory images.
• Can prioritize – lower-priority process is swapped out so higher-priority process can be loaded and executed.
• Major part of swap time is transfer time; total transfer time is directly proportional to the amount of memory swapped.
• slow

Virtual memory-paging
• Partition physical memory into small equal sized chunks
• Called frames
• Divide each process’s virtual (logical) address space into same size chunks
• Called pages
• Virtual memory addresses consist of a page number and offset within the page
• OS maintains a page table
• contains the frame location for each page
• Used by hardware to translate each virtual address to physical address
• The relation between virtual addresses and physical memory addresses is given by page table
• Process’s physical memory does not have to be contiguous
Paging
• No external fragmentation
• Small internal fragmentation (in last page)
• Allows sharing by mapping several pages to the same frame
• Abstracts physical organisation
• Programmer only deal with virtual addresses
• Minimal support for logical organisation
• Each unit is one or more pages
Memory Management Unit (Translation Look-aside Buffer/TLB)

Lec14 Virtual memory
Page-based VM
Virtual Memory
– Divided into equal sized pages
– A mapping is a translation between
• A page and a frame
• A page and invalid
– Mappings defined at runtime
• They can change
– Address space can have holes
– Process does not have to be contiguous in physical memory
Typical Address space Layout
• Stack region is at top, and can grow down
• Heap has free space to grow up
• Text is typically read-only
• Kernel is in a reserved, protected, shared region
• 0-th page typically not used, why? -> NULL pointer
• A process may be only partially resident
– Allows OS to store individual pages on disk
– Saves memory for infrequently used data & code
• What happens if we access non resident memory? -> Page fault
Page fault
• Referencing an invalid page triggers a page fault
• An exception handled by the OS
• Broadly, two standard page fault types
– Illegal Address (protection error)
• Signal or kill the process
– Page not resident
• Get an empty frame
• Load page from disk
• Update page (translation) table (enter frame #, set valid bit, etc.
• Restart the faulting instruction
Shared Pages
Private code and data
– Each process has own copy of code and data
– Code and data can appear anywhere in the address space
Shared code
– Single copy of code shared between all processes executing it
– Code must not be self modifying
– Code must appear at same address in all processes
Trashing
• CPU utilisation tends to increase with the degree of multiprogramming
– number of processes in system
• Higher degrees of multiprogramming – less memory available per process
• Some process’s working sets may no longer fit in RAM
– Implies an increasing page fault rate
• Eventually many processes have insufficient memory
– Can’t always find a runnable process
– Decreasing CPU utilisation
– System become I/O limited
- This is trashing

Recovery From Thrashing
In the presence of increasing page fault frequency and decreasing CPU utilisation
– Suspend a few processes to reduce degree of multiprogramming
– Resident pages of suspended processes will migrate to backing store
– More physical memory becomes available
• Less faults, faster progress for runnable processes
– Resume suspended processes later when memory pressure eases
Page Size
Increasing page size
Increases internal fragmentation
§ reduces adaptability to working set size
Decreases number of pages
Reduces size of page tables
Increases TLB coverage
Reduces number of TLB misses
Increases page fault latency
Need to read more from disk before restarting process
Increases swapping I/O throughput
Small I/O are dominated by seek/rotation delays
Optimal page size is a (work-load dependent) trade-off
Lec16 Multiprocessor Systems
Amdahl’s law
•Given a proportion P of a program that can be made parallel, and the remaining serial portion (1-P), speedup by using N processor

Bus-based uniform memory access multiprocessors
Simplest MP is more than one processor on a single bus connect to memory
• Access to all memory occurs at the same speed for all processors.
• Bus bandwidth becomes a bottleneck with more than just a few CPUs

Multiprocessor caches
Each processor has a cache to reduce its need for access to memory
• Hope is most accesses are to the local cache
• Bus bandwidth still becomes a bottleneck with many CPUs

Cache consistency
Cache consistency is usually handled by the hardware.
• Writes to one cache propagate to, or invalidate appropriate entries on other caches
• Cache transactions also consume bus bandwidth
Multi-core Processor

Summary
With only a single shared bus, scalability can be limited by the bus bandwidth of the single bus
• Caching only helps so much
•Multiprocessors can
• Increase computation power beyond that available from a single CPU
• Share resources such as disk and memory
•However
• Assumes parallelizable workload to be effective
• Assumes not I/O bound
• Shared buses (bus bandwidth) limits scalability
• Can be reduced via hardware design
• Can be reduced by carefully crafted software behaviour
• Good cache locality together with limited data sharing where possible
Each CPU has its own OS
• Statically allocate physical memory to each CPU
• Each CPU runs its own independent OS
• Share peripherals
• Each CPU (OS) handles its processes system calls

• Used in early multiprocessor systems to ‘get them going’
• Simpler to implement
• Avoids CPU-based concurrency issues by not sharing
• Scales – no shared serial sections
• Modern analogy, virtualisation in the cloud
Issues
• Each processor has its own scheduling queue
• We can have one processor overloaded, and the rest idle
• Each processor has its own memory partition
• We can a one processor thrashing, and the others with free memory
• No way to move free memory from one OS to another
Symmetric Multiprocessors (SMP)

• OS kernel run on all processors
• Load and resource are balance between all processors
• Including kernel execution
• Issue: Real concurrency in the kernel
• Need carefully applied synchronisation primitives to avoid disaster
• One alternative: A single mutex that make the entire kernel a large critical section
• Only one CPU can be in the kernel at a time
• The “big lock” becomes a bottleneck when in-kernel processing exceeds what can be done on a single CPU
• Better alternative: identify largely independent parts of the kernel and make each of them their own critical section
• Allows more parallelism in the kernel
• Issue: Difficult task
• Code is mostly similar to uniprocessor code
• Hard part is identifying independent parts that don’t interfere with each other
• Remember all the inter-dependencies between OS subsystems.
Multiprocessor synchronisation
Given we need synchronisation, how can we achieve it on a multiprocessor machine?
• Unlike a uniprocessor, disabling interrupts does not work.
• It does not prevent other CPUs from running in parallel
• Need special hardware support
Test and set
Hardware guarantees that the instruction executes atomically on a
CPU.
• Atomically: As an indivisible unit.
• The instruction can not stop half way through
• It does not work without some extra hardware support
• A solution:
• Hardware blocks all other CPUs from accessing the bus during the TSL instruction to prevent memory accesses by any other CPU.
• TSL has mutually exclusive access to memory for duration of instruction
BUT Test-and Set is a busy-wait synchronisation primitive. -> Spinlock
• Issue:
• Lock contention leads to spinning on the lock
• Spinning on a lock requires blocking the bus which slows all other CPUs down
• Independent of whether other CPUs need a lock or not
• Causes bus contention
Caching does not help reduce bus contention
• Either TSL still blocks the bus
• Or TSL requires exclusive access to an entry in the local cache
• Requires invalidation of same entry in other caches, and loading entry into local cache
• Many CPUs performing TSL simply bounce a single exclusive entry between all caches using the bus
Reducing Bus contention
• Read before TSL
• Spin reading the lock variable waiting for it to change
• When it does, use TSL to acquire the lock
• Allows lock to be shared read-only in all caches until its released
• no bus traffic until actual release
• No race conditions, as acquisition is still with TSL.
Benchmark

• Test and set performs poorly once there is enough CPUs to cause contention(争夺) for lock
• Expected
• Read before Test and Set performs better
• Performance less than expected
• Still significant contention on lock when CPUs notice release and all attempt acquisition
• Critical section performance degenerates
• Critical section requires bus traffic to modify shared structure
• Lock holder competes with CPU that’s waiting as they test and set, so the lock holder is slower
• Slower lock holder results in more contention
Spinning vs blocking and switching
Uniprocessor
• Spinning (busy-waiting) on a lock makes no sense on a uniprocessor
• The was no other running process to release the lock
• Blocking and (eventually) switching to the lock holder is the only sensible option.
•On SMP systems, the decision to spin or block is not as clear.
• The lock is held by another running CPU and will be freed without necessarily switching away from the requestor


Multiprocessor
• Blocking and switching
• to another process takes time
• Save context and restore another
• Cache contains current process not new process
• Adjusting the cache working set also takes time
• TLB is similar to cache
• Switching back when the lock is free encounters the same again
• Spinning wastes CPU time directly


Trade off
• If lock is held for less time than the overhead of switching to and back
ÞIt’s more efficient to spin
Spinlocks expect critical sections to be short
No waiting for I/O within a spinlock
No nesting locks within a spinlock
Preemption and spinlocks
• Critical sections synchronised via spinlocks are expected to be short
• Avoid other CPUs wasting cycles spinning
• What happens if the spinlock holder is preempted at end of holder’s timeslice
• Mutual exclusion is still guaranteed
• Other CPUs will spin until the holder is scheduled again!!!!!
Spinlock implementations disable interrupts in addition to acquiring locks
• avoids lock-holder preemption
• avoids spinning on a uniprocessor
Lec17 scheduling
Basic
The scheduler decides who to run next.
– The process of choosing is called scheduling.
Scheduling decisions can have a dramatic effect on the perceived performance of the system
– Can also affect correctness of a system with deadlines
Application Behaviour

- CPU-Bound process
- • Spends most of its computing
- • Time to completion largely determined by received CPU time
- I/O-Bound process
- – Spend most of its time waiting for I/O to complete
- • Small bursts of CPU to process I/O and request next I/O
- – Time to completion largely determined by I/O request time
- – Spend most of its time waiting for I/O to complete
Observation
• We need a mix of CPU-bound and I/O-bound processes
to keep both CPU and I/O systems busy
• Process can go from CPU- to I/O-bound (or vice ersa) in different phase of execution
Insight
• Choosing to run an I/O-bound process delays a CPU-bound process by very little
• Choosing to run a CPU-bound process prior to an I/O-bound process delays the next I/O request significantly
– No overlap of I/O waiting with computation
– Results in device (disk) not as busy as possible
Þ Generally, favour I/O-bound processes over CPU-bound processes
When is scheduling performed
when a process (or thread) can no longer continue, or when an activity results in more than one ready process.
Preemptive versus Non-preemptive Scheduling 先发调度与后发调度
Non-preemptive
– Once a thread is in the running state, it continues until it completes, blocks on I/O, or voluntarily yields the CPU
– A single process can monopolised the entire system
Preemptive
– Current thread can be interrupted by OS and moved to ready state.
– Usually after a timer interrupt and process has exceeded its maximum run time
• Can also be as a result of higher priority process that has become ready (after I/O interrupt). – Ensures fairer service as single thread can’t monopolise the system
• Requires a timer interrupt
Categories of scheduling algorithms
• The choice of scheduling algorithm depends on the goals of the application (or the operating system)
– No one algorithm suits all environments
• We can roughly categorise scheduling algorithms as follows
– Batch Systems
• No users directly waiting, can optimise for overall machine performance
– Interactive Systems
• Users directly waiting for their results, can optimise for users perceived performance
– Realtime Systems
• Jobs have deadlines, must schedule such that all jobs (predictably) meet their deadlines.
Goals of scheduling Algorithms
All algorithms
– Fairness
• Give each process a fair share of the CPU
– Policy Enforcement
• What ever policy chosen, the scheduler should ensure it is carried out
– Balance/Efficiency
• Try to keep all parts of the system busy
Interactive Algorithms
– Minimise response time
• Response time is the time difference between issuing a command and getting the result
– E.g selecting a menu, and getting the result of that selection
• Response time is important to the user’s perception of the performance of the system.
– Provide Proportionality
• Proportionality is the user expectation that short jobs will have a short response time, and long jobs can have a long response time.
• Generally, favour short jobs
Real-time Algorithms
– Must meet deadlines
• Each job/task has a deadline.
• A missed deadline can result in data loss or catastrophic failure
– Aircraft control system missed deadline to apply brakes
– Provide Predictability
• For some apps, an occasional missed deadline is okay
– E.g. DVD decoder
• Predictable behaviour allows smooth DVD
Round Robin Scheduling
• Each process is given a timeslice to run in
• When the timeslice expires, the next process preempts the current process, and runs for its timeslice, and so on
– The preempted process is placed at the end of the queue
• Implemented with
– A ready queue
– A regular timer interrupt
Pros
– Fair, easy to implement
Cons
– Assumes everybody is equal
Issue: how long should the timeslice be
– Too short
• Waste a lot of time switching between processes
• Example: timeslice of 4ms with 1ms context switch = 20% round robin overhead
– Too long
• System is not responsive
• Example: timeslice of 100ms
– If 10 people hit “enter” key simultaneously, the last guy to run will only see progress after 1 second.
• Degenerates into FCFS if timeslice longer than burst length
Priorities
• Each Process (or thread) is associated with a priority
• Provides basic mechanism to influence a scheduler decision:
– Scheduler will always chooses a thread of higher priority over lower priority
• Priorities can be defined internally or externally
– Internal: e.g. I/O bound or CPU bound
– External: e.g. based on importance to the user
Priority-driven preemptively scheduled

Usually implemented by multiple priority queues, with round robin on each queue
Con:
– Low priorities can starve
• Need to adapt priorities periodically
– Based on ageing or execution history
Traditional UNIX scheduler
• Two-level scheduler:
– High-level scheduler schedules processes between memory and disk
– Low-level scheduler is CPU scheduler
• Based on a multi-level queue structure with round robin at each level
• The highest priority (lower number) is scheduled
• Priorities are re-calculated once per second, and re-inserted in appropriate queue
– Avoid starvation of low priority threads
– Penalise CPU-bound threads
• Priority = CPU_usage +nice +base
– CPU_usage = number of clock ticks
• Decays over time to avoid permanently penalising the process
– Nice is a value given to the process by a user to permanently boost or reduce its priority
• Reduce priority of background jobs
– Base is a set of hardwired, negative values used to boost priority of I/O bound system activities
• Swapper, disk I/O, Character I/O

Multiprocessor scheduling
A single shared ready queue
When a CPU goes idle, it takes the highest priority process from the shared ready queue
Pros
– Simple
– Automatic load balancing
Cons
– Lock contention on the ready queue can be a major bottleneck
• Due to frequent scheduling or many CPUs or both
– Not all CPUs are equal
• The last CPU a process ran on is likely to have more related entries in the cache -> waste time to load another process
Affinity scheduling
Basic Idea
– Try hard to run a process on the CPU it ran on last time
• One approach: Multiple Queue Multiprocessor Scheduling
Multiple Queue SMP Scheduling
• Each CPU has its own ready queue
• Coarse-grained algorithm assigns processes to CPUs
– Defines their affinity, and roughly balances the load
• The bottom-level fine-grained scheduler:
– Is the frequently invoked scheduler (e.g. on blocking on I/O, a lock, or exhausting a timeslice)
– Runs on each CPU and selects from its own ready queue
• Ensures affinity
– If nothing is available from the local ready queue, it runs a process from another CPUs ready queue rather than go idle
• Termed “Work stealing
Pros
No lock contention on per-CPU ready queues in the (hopefully) common case
– Load balancing to avoid idle queues
– Automatic affinity to a single CPU for more cache friendly behaviour
多队列 SMP 调度(Multiple Queue SMP Scheduling)是一种用于多处理器(SMP,Symmetric Multi-Processing)系统的进程调度策略。在这种策略中,每个处理器都有自己的一组独立的进程队列。这种方法旨在通过将进程分配到不同的处理器上,实现更好的负载平衡和并行性。
多队列 SMP 调度的主要特点如下:
- 独立的进程队列:在多队列 SMP 调度中,每个处理器都有自己的一组独立的进程队列,用于存储和管理等待执行的进程。每个处理器根据自己的队列中的进程来执行调度和分配任务。
- 负载平衡:多队列 SMP 调度策略尝试在处理器之间实现负载平衡。当一个处理器的队列中的进程数量明显多于其他处理器时,系统可能会将一些进程迁移到其他处理器的队列中,以实现更好的负载平衡。
- 并行性:多队列 SMP 调度充分利用了多处理器系统的并行性。通过在不同的处理器上同时执行多个进程,可以提高系统的整体性能和吞吐量。
- 局部性:由于每个处理器都有自己的进程队列,多队列 SMP 调度有助于维护进程的局部性。这意味着一个进程可能会在同一个处理器上连续执行,从而提高缓存命中率和性能。
然而,多队列 SMP 调度也存在一些挑战和缺点,例如在负载平衡和同步开销方面的问题。为了解决这些问题,研究人员和工程师们还提出了许多其他的多处理器调度策略,如全局队列调度、簇调度等。
总之,多队列 SMP 调度是一种适用于多处理器系统的进程调度策略,通过为每个处理器分配独立的进程队列来实现负载平衡和并行性。尽管存在一些挑战,但这种方法仍然是多处理器调度的一个重要方案
Cons:
- 负载平衡困难:尽管多队列 SMP 调度试图在处理器之间实现负载平衡,但在实践中,达到理想的负载平衡可能非常具有挑战性。在某些情况下,可能会出现一个或多个处理器的队列拥有大量进程,而其他处理器却处于空闲状态。这种不平衡可能导致系统资源的浪费和性能下降。
- 同步和通信开销:在多队列 SMP 调度中,处理器之间可能需要频繁地进行进程迁移以实现负载平衡。这会导致同步和通信开销增加,从而降低系统性能。在具有许多处理器的系统中,同步和通信开销可能会变得非常显著。
- 进程优先级管理困难:由于每个处理器都有自己的进程队列,管理不同处理器队列中的进程优先级可能变得复杂。在全局队列调度策略中,所有处理器共享一个进程队列,因此优先级管理相对更简单。
- 高速缓存一致性问题:多队列 SMP 调度中的进程迁移可能会导致高速缓存一致性问题。当一个进程从一个处理器迁移到另一个处理器时,它可能需要重新填充高速缓存,这会导致性能下降。虽然多队列 SMP 调度有助于维护进程的局部性,但频繁的进程迁移可能会削弱这一优势。
- 可伸缩性问题:在具有大量处理器的系统中,多队列 SMP 调度可能会遇到可伸缩性问题。随着处理器数量的增加,实现有效的负载平衡、优先级管理和同步变得更加困难。
尽管存在这些弊端,多队列 SMP 调度仍然是多处理器系统中的一种有效调度策略。为了解决这些问题,研究人员和工程师们还提出了许多其他的多处理器调度策略,如全局队列调度、簇调度等。这些策略在不同方面具有各自的优缺点,因此需要根据具体应用场景和需求进行选择。
TO BE NOTICED: NOT CONTAIN PART OF LEC 14 AND MOST PART OF LEC 15