Text preview for : CSL-91-7_Virtual_Memory_Replacement_Using_Historical_Information_on_Virtual_Objects.pdf part of xerox CSL-91-7 Virtual Memory Replacement Using Historical Information on Virtual Objects xerox parc techReports CSL-91-7_Virtual_Memory_Replacement_Using_Historical_Information_on_Virtual_Objects.pdf



Back to : CSL-91-7_Virtual_Memory_R | Home

Vi rtual Memory Replacement
. Using Historical Information
on Vi rtual Objects
Virtual Memory Replacement Using Historical
Information on Virtual Objects

Robert B. Hagmann
Xerox Palo Alto Research Center


CSl- 91-7 September 1991 [P91 - 00094]


(!) Copyright 1991 Xerox Corporation. All rights reserved.


Abstract: Most current virtual memory systems retain very little information about
page usage. Typically, only a single bit of history is used. The history is usually only
kept on pages currently resident in physical memory. With larger main memories,
changes in applications, and the increases in file system caching in memory, these
design decisions should be reviewed. If mUltiple bits of history are kept on virtual
objects, whether they are memory resident or not, then there is the potential to
improve memory replacement performance.

This paper describes a measurement technique that is used to monitor memory usage.
It then proposes a new virtual memory replacement algorithm that is partially based
on periodic, sequential, and transient behaviors. It also describes an approximation to
LRU, called cluster LRU, that performs better on the programs measured than the usual
clock algorithm. The performance of various algorithms is compared by trace driven
simulation. Finally, this paper shows how a Translation Lookaside Buffer can be
designed for a RISC machine to support both cluster lRU and detection of periodic,
sequential, and transient behaviors.

CR Categories and Subject Descriptors: 0.4.2 [Memory Structures]: Design Styles -
Virtual memory; 0.4.2 [Operating Systems]: Storage Management - Virtual memory

Additional Keywords and Phrases: Large Main Memories




XEROX Xerox Corporation
Pa 10 Alto Research Center
3333 Coyote Hill Road
Palo Alto, California 94304
1. Introduction 1




1. Introduction

Much has changed since most current virtual memory replacement algorithms
were designed over a decade ago. The amount of memory and the usage of memory
was different then. The characteristics of program behavior when using hundreds of
pages is different from using tens of thousands of pages of memory. Many machines
are now single user workstations rather than time - shared machines. On a single user
machine, the time for page fault is now directly visible to the interactive user,
particularly when pages are faulted over a network. In addition, some systems now do
not make major distinctions between pages for file system buffers and pages for
program execution (see [Nels88] for Sprite and [Gingell] for SunOS 4.0). Mapped files
further confuse the distinction between file system buffers and pages for program
execution. Excessive demand by one part of the system may have a bad performance
impact on the other.
Systems with vast amounts of memory are being constructed. Algorithms that
work well with hundreds of pages may not scale very well to a million pages.
Below are examples of real situations that may cause problems. The author
regularly experiences seven of these on his 48 MByte workstation.
transient access to a large amount of file data (e.g., grep through several
megabytes of source)
applications that are used every few minutes, but are inactive between uses
(e.g., spell checkers, compilers, loaders, debuggers, typesetters, calendars)
garbage collection
interactive and background use of a workstation, where the background
program will consume lots of memory so that the interactive pages tend
to get paged out if not frequently referenced; sometimes called the" cold
ll
window problem
multi - media data that is transient (e.g., a single screen image is 4 MB [24 bits
per pixel, 1024 x 1280])
periodic use of large applications, where the total size is larger than memory
algorithms that cycle through phases, as in some scientific programs
sequential access to very large files
copy of lots offile data usingNFS from a clientworkstation
All of these examples show memory demands that are hard for most current
virtual memory replacement algorithms.


XEROX PARe, CSL - 91 -7, September 1991
2 Virtual Memory Replacement Using Historical Information on Virtual Objects




Many users never do anything similarto these examples and do not experience any
problems. If all you do is read your (text) mail, edit with vi, and run the C compiler,
then there won1t be a problem. Others buy so much memory that the problem is
manageable. Memory is still the dominant cost in many systems. Poor memory
management"can cause users to buy more memory than they really need.

Operating systems papers are about abstractions and implementations. This paper
is primarily about the abstraction of virtual memory replacement. It proposes a small
paradigm shift. Virtual memory replacement should be based on virtual object
histories. Most popular systems use a single bit of reference data, and the bit is only
kept for pages that are in physical memory. With virtual object histories, periodic,
sequential, and transient reference behavior (discussed more below) can be detected
or estimated. Based on histories and algorithms, substantial improvement in the
page - in behavior of a system can be made. This paper presents algorithms that have
significant performance improvements for a small set of traces. One algorithm is an
approximation of LRU that does substantially better than lIc10ckll for the traces. A
second algorithm divides pages into two pools: those pages that have a periodic,
sequential, and transient reference behavior and those pages that have an LRU
reference behavior, and then selects pages for replacement from these two pools.

The rest of this paper is organized as follows. Section 2 has a discussion of the
issues and history of memory replacement; Section 3 investigates the reference
patterns that are typical for problem appl ications; Section 4 describes the tracing
technique, the simulator that was built, and the algorithms that were simulated;
Section 5 reports on the measurement and simulation; Section 6 gives a straw man
design of hardware that can implement one of the proposed algorithms; Section 7 is
the conclusions and proposals for future work.


2. Discussion

Currently, it is typical that virtual memory systems are built for short term
demands. Until fairly ~ecently, paging was a major bottleneck in system performance.
Although paging is now not normally a bottleneck, systems are still plagued with
occasional, substantial delays due to poor paging decisions. Part of this can be easily
traced to the system making reasonable short - term decisions that have a bad
medium - term effect. Some of the cause of this is the algorithms in use today, as they
are just tuned versions of older algorithms. The older algorithms were concerned with
short - term memory demand because memory was too expensive to have enough to


XEROX PARC, CSL - 91 - 7, September 1991
2. Discussion 3




allow medium - term effects to be important. Even though they have been re - tuned
(hopefully), the basic assumptions of algorithms partially determine the results.

This is not a new observation. The book on the 4.3BSD UNIX Operating System
[Leff89] says:. "4.3BSO is not perfect. In particular, the virtual- memory system needs
to be completely replaced. The new virtual- memory system needs to provide
algorithms that are better suited to the large memories and slow disks currently
available, and needs to be less VAX architecture dependent."

For example, consider the global clock algorithm that was used in 4.3 BSO
([Baba81] and see discussion in [Leff89] Section 5.12). This algorithm is used in similar
systems such as SunOS and USL System V Release 4. Global clock is a commonly used
approximation of LRU. A variant of clock is also used in IBM's VM system [Tetz87].
Sprite uses clock for VM pages, but exact LRU for fi Ie system pages [Nels88]. Mach also
has a LRU approximation. It uses two queues of pages: active and inactive lists. FIFO
replaceme"nt is done on the inactive queue. Pages are moved from the active queue to
the inactive queue when the inactive queue is too small. Faults on inactive pages move
them back to the active queue [Youn89]. This is similar to clock in that it makes pages
vulnerable to page - out in a window, with a reference in the window preventing
page - out.

Basically as implemented in SunOS, the clock algorithm has two hands that move
together but are separated by some fixed number of pages. The anticipated need for
memory causes the hands to turn. The hands point to physical memory pages that are
conceptually arranged as a clock. Each physical page has a reference bit and a
modified bit. The front hand turns off the reference bit. A reference to a page causes
the hardware to turn on the reference bit. The back hand samples the reference bit. If
the reference bit has not been set, then the page is added to the "cached free" list. If
modified bit is set, the page must first must be cleaned by writing it to disk. Once the
page is added to the cached free list, its identity and contents are not lost. If the page
is needed (e.g., faulted) before it is used as a free page, it is removed from the cached
free list and used.

Note that this is an algorithm that is physical memory based: ~he only reference
data kept is for pages in physical memory. In addition, the amount of data is one bit or
less. When the page is not between the two hands, the value of the reference bit is
meaningless.

When everything fits, the clock does not turn. When the clock is not turning, no
sampling of memory usage is being done at all (almost). If there is sudden memory


XEROX PARC, CSL - 91 -7, September 1991
4 Virtual Memory Replacement Using Historical Information on Virtual Objects




demand, the system reacts by turning the clock. Extremely long - time - to - last-
reference pages are not replaced if they are not near the clock hands. At high clock
turning rates, it takes only two seconds between the front and back hand examining a
page (using the SunOS 4.0 tuning parameters). If in this narrow window the page is
not touched, it is taken and made into a free page. At worst, this degenerates into
random replacement. In practice random replacement would not happen for all
pages, but it could happen for a good number of pages. In any event, the reference
history during the time when everything fit is lost.

Systems are tuned to make the clock turn slowly. For the DEC VAX (or at least for
the VAX - 11/780) this was important. There are no reference bits on the VAX. vmunix
(and the BSD UNIXes) simulates them by invalidating the page, taking a trap if the
page is referenced, and recording the reference [Baba81]. Using hardware with
reference bits, the cost of sampling pages is greatly reduced. On a SparcStation - 1,
turning the clock a few times faster than normal using SunOS 4.0 had a barely
noticeable change in system CPU time (as measured by vmstat). The default maximum
clock rate was tuned in SunOS 4.1 to be a factor of five faster than in SunOS 4.0.
Turning the clock "too fast" makes the system do page cleans (write of dirty pages to
the swap file) so it is not easy to get a good cost figure. The point is that turning the
clock quickly does not and should not cost t.oo much CPU time.

With larger memories and single user workstations, faults are by comparison much
worse now that 20 years ago. For a PDP - 10 .(late 1960's), it took about 30,000
instruction times to answer a fault. For a SparcStation - 2, a fault takes twenty times as
long, when measured in instruction times. Systems are getting more and more 1/0
~ound [Oust89]. On a time - shared PDP - 10, the time of a page fault was not lost:
other users would get the cpu. On a workstation, there are often multiple activities
running concurrently, but the interactive use of the machine determines how happy
the user is with it. A fault in the interactive software stops the machine as far as the
user is concerned. As our machines do not fault very much, a storm of faults is very
noticeable and very obnoxious. We tend to remember the storms and forget the
normally good performance. The variance is high and people do not like high
variance.

Systems are merging uses of physical memory [Abro89]. Before SunOS 4.0, the
SunOS virtual memory system and the file systems had separate, fixed - size pools of
physical memory to administer. For virtual memory bound or for file system bound
tasks, this was unfortunate since physical memory was not being used where it was


XEROX PARe, CSl- 91 -7, September 1991
3. Reference patterns 5




needed. SunOS 4.0 changes this by making one pool of physical memory from which
both the virtual memory system and the file system(s) compete [Moran]. (Some data
structures are allocated from kernel memory, but all buffer pool pages and virtual
memory pages are allocated from the same pooL) However, the clock algorithm is still
used to determine what page to evict when memory is needed. This means that
transient file system behavior that used to destroy the buffer pool but leave virtual
memory intact, now destroys buffering and caching of all physical memory. Typical
transient behavior that does this is by "greping" (searching for a pattern) a file that is
bigger than memory. SunOS 4.1 had some performance tuning so that some types of
clustering and sequential behavior can be told to the system as hints (madvise system
call) [McV091].
Other systems have used large physical memories to improve file system
performance. In Sprite, very large client memories are used. There is a soft boundary
between the file system and virtual memory pages. File system pages are artificially
aged to make better VM performance.


3. Reference patterns

Individual, isolated page faults are usually not a problem. Except for single
threaded, real-time applications, they influence the latency or throughput of a
system in a minor way. But, page fault storms do effect the system. Why do page fault
storms occur? By looking at the reference patterns, three types of non - LRU reference
behaviors emerged. The patterns are:
Periodic: An object or a page is used infrequently, but is re - referenced. The
period mayor may not be fixed. Examples: calendars (the author just had a 15 second
page fault storm opening his calendar), garbage collection, and receiving mail.
Sequential: When an object is used, it is read from the beginning to the end. This
is typical for files. Examples: grep of a large file, loading an image file to a
workstation, and playing ~ video segment.
Transient: An object is referenced for a brief time, then not again for a very long
time. Examples: grep of many files in a directory and listening to voice mail.
Periodic behavior, where the period is longer than the sampling time of the virtual
memory replacement algorithm, is not handled well by algorithms that approximate
LRU. References are so far apart that they are not noticed by the algorithms. For
sequential references, the page referenced previously is the least likely to be
referenced next. Transient references also are not good predictors of future


XEROX PARC, CSL - 91 -7, September 1991
6 Virtual Memory Replacement Using Historical Information on Virtual Objects




references .

. For all of these cases, the LRU assumption is violated: all of these types of
references are not LRU. Pages that have references that are periodic, sequential, or
transient app.ear to be better served by algorithms that replace pages in a MRU -like
fashion, while other pages are replaced LRU. The challenge is to discover these types
of reference patterns and integrate this information into the resource management
for memory.

Of these three patterns, the one that is most commonly recognized in current
systems is sequential. With sequential access, pre - paging or file system read - ahead
can be used to stream data in more efficiently by overlapping processing with reading,
and by accessing secondary or network file storage in larger pieces. Also, post-
flushing of pages once they are used can get high bandwidth access to data in a fixed
buffer size (as opposed to polluting memory with transient data).

If LRU or Global Clock replacement is done for transient pages, then the transient
pages will evict other pages and will be retained in memory for a long time (as long as
the clock or LRU replacement takes (e.g., minutes