Introduction to Virtual Memory

Virtual memory is the idea of creating a logical spaces of memory locations, and backing the logical spaces with real, physical memory, according to a set of dynmically maintained mapping tables. The reasons why this is done include,

If computation is fire, memory is oxygen. Not only is it required, but it is required in quantities. Like a furnace, fast computation requires that memory be forced fed into the computation. Computers will differ in the bandwidth between the memory and the processing unit.

Typically the memory is some distance away from the processor. There is a bus that accesses memory, and the CPU bridges onto this bus. In Intel motherboards, this is called the northbridge; it is held close the the CPU, closer than the southbridge, but it is still some distance away.

The contents of memory is moved to and from the CPU in bucketloads, known as cache lines. The AMD K8 has 64 byte cache lines, so memory is handled 256 bits a time. Prefetching is used on the instructions so that they are brought into memory before they are needed. Instructions that jump can ruin this prefetch strategy, and the compiler takes this into consideration why compiling the code.

Each program wants to see memory as a sequence of consecutively numbered bytes, uninterrupted by either allocations to other programs or to the operating system. In this memory is placed both instructions and data. The data can have several flavors; it could be locations for global varilables, a fixed set of locations, available throughout the run of the computation; it could be dynamic, either temporary locations available only for the run of a function, or allocated and refered to by use of a pointer, carefully passed around to interested parties which need to reference the data.

It might use the same addresses for different data when compared to other programs. Usually all programs have the same start address. At this address, the same virtual address for all programs, is different data. The address space seen by the program is only virtual. The virtual space is provisioned with memory taken from the physical space of real memory. The memory management unit (MMU) makes this possible. The MMU is our hero.

The MMU maps pieces of virtual space to equal sized pieces of physical space. In this way, when the CPU goes to read from or write to a memory location, it indicates the location in virtual space numbering, a virtual address. The MMU is hardware that intervenes and converts that address to a physical location in real memory, the physical address.

Typically, the linker-loader lays out the program and its data into memory. A typical layout is: empty space, then the program in a read-only text area, then the static and pre-initialized data. Above this is the heap, the area for dynamically allocated data, which the C program would get using malloc. The C++ program can use new to create data structures (objects) in the heap. The heap is not of fixed size but grows upwards to the top of memory.

The top of memory is not the exhaustion of physical memory, but the exhaustion of virtual memory. Since virtual memory is virtual, you do not run out of substance ever. You run out of addresses. A 32-bit computer has enough addresses for 4 billion bytes of memory. Beyond this, you are no longer using "memory" in the sense of this section. It could be a file in a files system, those are longer than 4G possibly. It could be a stream from the network. But it cannot be placed into memory, regardless of your physical memory. Because virtual memory is exhausted.

The heap, in fact, cannot go up to the top of memory because the stack is at the top of memory, and it is growing down. Virtual memory is exhausted when the combined demands on the heap and stack use up all of the memory addresses which are in between. The stack is really where most of the data is stored. All local variables are there; practically everything you do not malloc and you have not made global (file scope) will be on the stack.

There is another reason that the heap might not be at the top of memory. Some operating systems truncate the full virtual space as allowed to the user programs, and place the operating system at the top of memory. Linux does this. Only 3G of virtual space is allowed the user. Between 3G and 4G virtual resides the kernel. This is done so that the kernel is part of every user program. When the user runs, the upper 1G is forbidden for read or write, but when the kernel runs, the full 4G is available, so that the kernel can read or write both its own data, and that of the user program.

Not all operating systems do this. OSX has completely distinct spaces for operating system and user programs.

The MMU is an hardware component on the CPU, and is an integral part of the computer architecture. When Intel releases a new CPU, the details of the MMU are just as important as the data crunching aspects. The component matches the virtual memory to the physical memory by use of look-up tables stored in the RAM. The tables are managed by the operating system, but are used by the hardware in an automatic way.

The MMU does not keep track of the association between virtual and physical memory on a byte-by-byte basis. It works in terms of units called pages and page frames.

A page can be various sizes, that depends on the technology of the CPU and also can be chosen within parameters of the CPU by the operating system. Linux uses a very typical 4K byte page size.

Each and every mmemory access goes through the MMU where the address is broken into the page portion and the offset in the page. The page number is looked up in the page table to get a page frame. The offset from the page is applied to the page frame in order to find the actual physical byte number needed to be retrieved. If the virtual address is 1492, and the page size is 100 bytes, the page number is 14, and the offset 92. If the 14-th entry in the page table says 27, then the byte will be found in memory at location 2792 - offset of 92 into the page frame 27.

However, machines work base 2, meaning that values multiples of 2^k are most natural, so page sizes will be powers of 2, such as 4K. The page number is found but looking at the higher-order bits The offset is simply the lower-order bits.

The page tables are themselves in virtual memory, and have entires for themselves. That way the operating system can work with the page tables. Even an assembly language program uses virtual addresses. The information in the page table, however, is in page frames, the physical addresses, and are used immediately by the MMU to access physical memory.

Each process has its own virtual memory, with its own mappings to physical space, so each process has to have its own page tables. When the operating system switches the running process, it must switch the page tables. The kernel must have its own page tables as well. Physcial pages must be managed. Most are unassigned to any virtual page. Various events occur that makes the operating system claim them, and associate them with a virtual page of some process. Other events cause pages frames to be released. The data they contain can either be discarded or placed in "cold storage" for an eventual restoration of the virtual page. In that case, the page frame contents is written to the disk, into a swap file. The page tables are marked so that later the same location of the swap file can be copied back into a new page frame and reassociated with the virtual space, if needed.

Demand paged virtual memory is the basic way in which pages are associated and released. The program continues until it references a virtual address which the page tables fail to map. There has been as yet no assignment of a page frame to that virtual page. This failure is called a page fault. On fault, the program is demanding a page, and the operating system catches the fault, takes a free page frame and associates it to the page in which the fault occurs. The program is then restarted to retry the access.

Depending on the fault, the page frame is un-initialized (often just filled with zeros, to prevent inadvertent sharing of data between processes), refilled from the swap file (if this virtual page existed perviously but was paged out), or filled from a disk file. This last is what occurs when the fault is on the instructions of the program, rather than data. The program is not loaded entirely, rather it faults itself in.

As page frames sit without activity in main memory, they become candidates for release, to refuel the stock of free page frames. The hardware has some heuristic mechanisms for tracking the inactivity of pages, and identifying candidates for page-out. A page that is read-only can be released without further action. A page that has been written is called dirty, and the page must be written to the swap file before it can be reused.

That this system has any prayer of working depends on the fact that in a given time interval, a program runs only a small section of code and uses only a small collection of data. This is called locality of reference. The admittedly vague set of what the program is now using, both data and instruction, is called the working set. As long as the working set is resident, that is, all pages in the working set are associated with page frames, the program runs without faults. Too much faulting is bad. As the working set evolves there are faults. These faults bring in the new working set, and the pages no longer in the working set can be released.

The paging behavior of a program is very important for its efficiency. Although the code in the operating system does a good job to make the most of the physical memory as a resource to populate the virtual memory spaces of the various programs, a poorly behaving program can be a problem. Good programmers are aware of the paging behavior of their programs.

In eviction of a page from physical memory uses various heuristics to predict which is the least cost route to follow in eviction. It is true that if one could see the future, then the page to evict is one that will never be reused, or if all pages will be reused, evict the one that has first reuse the furtherest in the future. This can be proven by a comparison argument that substitutes the any other page evicted for an eviction of the pages with reuse furthest in the future, and arguing that this can't be worse.

We can't see the future, so heuristics are used. One is the Least Recently Used (LRU) heuristic. Rather than looking forward to the furthest reuse in the future, it looks for the oldest last use in the past. The idea is that past behavior predicts future behavior. However, even this is too complicated, and further simplifications are employed.

FreeBSD uses the following algorithm for page replacement. Periodically the page frames are swept clearing the access bit. If the access bit is found set, a counter associated with the page is incremented, else it is decremented. The counter is clipped to 0 through 64. Un-accessed pages are the with counter value 0.

FreeBSD places pages in one of five queues: active, inactive, cache, wired and free. Free pages are either zero-filled or not. The swapout demon zero fills free pages as it has time. Inactive pages are candidates for swapout, however they may be dirty. They are scheduled for writeback if they are. Once an inactive page is clean it is placed on the cache queue. Such a page can be pulled back into use, it is a sort of second-chance queue. From the cache queue the page goes to the free queue. Once in the free queue it eventually ends up on the zero-ed edge of the free queue.

FreeBSD uses a global page replacement mechanism. It does not track working sets per process. Windows does, because VMS did. (Whenever a Microsoft cognizenti begins to explain why WNT does something someway, just interrupt and say "it's because that's how VMS did it.") However FreeBSD has not found an advantage to this, rather contends that such a mechanism is more likely to mismanage than optimize resources.

To keep track for free page frames, those free that are zeroed (you can't reuse a frame until it is zero else you have a privacy problem), as well as possible reference counts for shared pages, a table is kept with one entry for each page frame.

There is yet another structure needed which we will call the Virtual Address Descriptor, or VAD. This keeps track of regions of virtual memory, aligning them with the requested memory backing store. (A backing store is the physical memory behind the virtual memory). It indicates whether this is data, then the backing store is swap, and initial faults get zero-filled pages, or this is text, then the backing store is a file, and it also indicates which file and the alignment between memory addresses and offsets in the file.

The VAD can declare the region "wired", which means it is never swapped. Often large portions of the kernel are wired. The VM code must be wired, else you will try to swap out the swap out. Also, memory that is involved in any DMA's to IO devices must remain in RAM during the DMA, so those will get wired. There is another meaning to "swapping" a process. When there are too many processes, memory pressure builds. There is not enough physical memory to hold all working sets, and the progress of all processes suffers. In fact, processes can "thrash", where more time is spent moving pages to and from the swap space than computing.

Some systems, including FreeBSD, will completely evict a process, all its pages, and put the process in a "swapped-out" state, where it will receive no cycles. The swap demon will trigger on low memory to choose a process to swapout. The process will enter sleep and all of its pages will be swapped out.

The map from virtual pages to physical page frames is in the form of a table which, for each page i, gives pf[i], the page frame. Not only does an architecture specify the number of bits in a virtual address, it also specifies the number of bits in a physical address. By coincidence, for 32-bit architectures, both are 32 bits long. This represents a space of 4G, either 4G for each process of virtual space, or 4G physical space for the maximum memory possible on the mother board (exception PAE, Physical Address Extension).

For 64 bit architecture, as embodied in the Intel/AMD chips, the virtual address space is 64 bits potentially, but only 48 bits are supported by the MMU, and the number of bits in a physical address varies by chip, with 40 bits being the original AMD64 specification. Hence 128 Terrabytes of virtual space maps to 1 Terrabyte of physical RAM.

To save space, multilevel page tables are used. The virtual address is broken into a page number and an offset. The exact numbers change whether it is a 32 bit or a 64 bit architecture, but in either case it works out very cleverly. For 32-bits there is a page directory and a bunch of page tables. Each contains 1024 entries, each entry is 4 bytes. So each is exactly one page in size. This is important because it means that the entries of the page directory and the page tables can be physical page frame values. The frame at which the page directory resides is loaded into the hardware. The top 10 bits of the virtual address select an entry from the directory. That entry is a page frame, and the next 10 bits of the virtual address select from that page frame an entry, which is the page frame of the page for the virtual address. The remaining 12 bits are offsets into that page frame for the exact byte.

For 64 bits architectures, 4K pages are still used. Their are four levels of page tables. Each entry of the table must be larger. The architecture plans to achieve 52 bits of space for physical memory, or 4 Pentabytes, so there must be 8 bytes per table entry. For 4k pages, only 9 bits of index per table are available; four tables gives 36 bits, plus the 12 bit offset gives the 48 bit virtual space. The unused top 16 bits are sign extended. That means that the value of bit 48 is copied up into the top 16 bites. Thus there is a "hole" in the virtual memory space, beginning at 0x0000 8000 0000 0000 and extending to 0xffff 7fff ffff ffff.

The PTE entries, etc., allow for 40 bits for page frame number. Therefore, the architecture plans for and ultimately can have a 52 bit address space, considering a 4K pages size (12 bits). Currently (October 2010), cpuinfo reports that my Intel(R) Xeon(R) CPU E5520 only support 40 bits of physical space. Hence of the 40 bits, only 28 are being used.

PAE is an alternative paging scheme for 32 bit processor which extends the physical address space to 36 bits. The PTE's are redefined to have page frame size of 24 bits, requiring 64 bit PTE's. Holding page size at 4K, becaues of this, the number of PTE's in a page drops to 512, or 9 bits, from 10 bits. That is, each PTE takes up 3 bits, bytes 0 through 7, rather than 2, bytes 0 through 3. An additional level of page tables must be introduced. In PAE the 32 bit virtual address is broken up as 2, 9, 9 and 12. First one of four different page directories is selected, then one of 512 subsequent page tables is selected, and then one of 512 subsequent page frames is selected. Finally the 12 bits of offsets is used to get the byte within the frame.

And there's more. The concept of virtual memory is multifold; there are many things enabled by virtual memory - efficient use of fast memory with slower disk-based backing store for memory not currently active, efficient use of memory for sparse memory layouts, isolation of memory between processes, purity of each processes view of memory as being all its own, various tricks for sharing memory between processes by mapping the same physical page into multiple virtual spaces, and so forth.

To speed the translation, especially now that it is understood that the resolution of a virtual address requires several table lookups, currently resolved mappings are cached in the processors translation lookaside buffer (TLB). On each virtual address translation, the lookaside is query in parallel with the start of a full page table lookup. If the mapping is currently in the lookaside buffer, the buffer returns quickly and the full lookup is cancelled.

Since each process has its own mappings, when a process is changed, the maps are changed. This is done quickly by changing only the value of the pointer to the page directory. However, now the lookaside buffers contain possibly wrong translations. On process switch, the lookaside buffers must be reset. The time to rebuild the lookaside buffer is a major cost of the context switch.

The Intel architecture allows for either 4K or a large page of 4M. The large page only requires the use of the page directory. The top 10 bits give the frame number for the large page, and the remaining 22 bits offset into the frame. The operating system uses large pages to map itself. (This is true in both Linux and Windows.) What is important about this is that fewer TLB entries are required to map the kernel. The Intel and AMD chips have TLB caches of 256 to 512 entries.

Pages are copied in according to demand paged, that is, on a page fault, the data desired is found, and page frame allocated. the data copied into the frame, and the page tables updated so that this frame is mapped to back the page that faulted.

Pages are retired by a method that resembles the Least Recently Used algorithm. However, where does the page data go? What if this is a dirty page, i.e., a page that has been written to. The data must be copied somewhere in case the data on this page become needed again.

When not in main memory, pages are stored in swap space, an area on a hard drive setup in page sized slots. A page out copies the information of the page to the swap slot. The page tables are updated to point this page no longer to any page frame but to this swap slot. This will require at least four marks (of some sort) in the PTE (Page Table Entry). The PTE must be marked as invalid, so that accesses to this memory will fault. Second, the PTE must mark that the data for the page is in swap, rather than this being a never populated section of virtual memory. Third and fourth, the entry must name the swap device and the slot number within the swap device wherein lies the page data.

Very conveniently, when a fault occurs on such a PTE, the information is handy to retrieve the page data from swap, and copy into a freshly assigned page frame. The PTE updated and the faulting instruction restarted.

Traditionally, swap space was a dedicated partition of the disk. It can also be a file in the filesystem, however it might be faster as a dedicated partition. Conventional wisdom is that swap space should be sized to one or two times that size of physical memory. Here is a look at the swap space on one of my Debian machines:

debian-32:~# fdisk /dev/sda

The number of cylinders for this disk is set to 4438.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
   (e.g., DOS FDISK, OS/2 FDISK)

Command (m for help): p

Disk /dev/sda: 36.5 GB, 36507222016 bytes
255 heads, 63 sectors/track, 4438 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x0007a0a4

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1        4250    34138093+  83  Linux
/dev/sda2            4251        4438     1510110    5  Extended
/dev/sda5            4251        4438     1510078+  82  Linux swap / Solaris

Command (m for help):

There is a related notion of swap out of a task. This occurs when the total memory required by working sets exceeds physical memory. Then the computer thrashes, it does no useful work it just pages in and out, and each page thrown out immediately needs to come back in. There is just not enough physical memory. In these situation the operating system will sleep a process and page out all its pages, reducing the requirements for physical memory. The process is said to be swapped out. Hopefully, thrashing will cease, the resident processes will finish efficiently, and the swapped out process will be swapped back in.

The same physical page frame can be mapped into multiple virtual memory spaces. One use for this is to have procedures communicate through shared memory. There are syscalls, such as shmat (shared memory attach) which provide this functionality through to the user. The memory need not map to the same address. Although data can be passed this way, pointers to data structures, in general, cannot. The pointers will be virtual addresses, and they need not point to the same thing in the different processes.

Shared memory can quickly fork child processes but marking writable pages as copy-on-write. The fork will no create a clone, but simply setup new page tables, copies of the original page tables. When a write causes the two virtual spaces to diverge, a new page frame is allocated for the child process, and the page table updated to substitute in the new frame for the old frame. This way the child process starts quickly, and only copies what is needed so that both processes see there own image of the memory.

The paging out of a shared page is complicated by the fact that it cannot go to swap space if only some of the processes have evicted the page from their working set. An additional structure is needed to accomplish the proper move to swap space for shared pages, which in Linux is called the swap cache.

The swap cache contains pairs (S,P) where S is a swap slot and P is a page frame. For pages that are shared, and are evicted in some working sets but not others, there is a (S,P) entry in the cache, where P is the current page frame, and S is the swap slot reserved for the page, when it does leave physical memory. When the first process evicts the page, the entry (S,P) is created, and the slot S is written in the evicting processes page table.

The VM system always checks the cache for an (S,P) hit, either when swapping out or swapping in. Further evictions will find the (S,P) entry in the cache and write consistently the slot S for the page; however only when all processes evict does the actual write to swap occur, and the (S,P) entry can now be removed.

On first swap in, a new page P' will be chosen and the entry (S,P') will be written to the cache, so that the remaining processes, if they swap the page in, will find the (S,P') entry in the cache and will simply update the page table.

In addition to the cache data structure, reference counts must be kept of the number of times a page frame appears in any page table, and the number of times a swap slot appears in any page table. This informs the VM system when to do the actual data move. When the page frame reference count reaches zero, then there are no more references to P, so the page is moved to slot S and the (S,P) entry is dropped. It is now useless. Likewise when the reference count on the swap slot drops to zero, the entry (S,P') is useless, and can be dropped.