Pune
Institute of Computer Technology
LFS
(Log-structured File
System)
By ...
Abhishikt N. Jain.
E-mail : author@abhishikt.8k.com
Web Site : http://abhishikt.8k.com
(1999 - 2000)
Acknowledgement:
I take this Opportunity to express my heartful thanks for all the people who have helped me in completion of this seminar. I am thankful to Prof. Dr. S. A. Kulkarni, H. O. D., Computer Dept., for giving me this opportunity to present my seminar. I also thank to my friend, Ashutosh S. Rajekar, who suggested me this topic. I also thank to Prashant V. Bhuse for his help. The thanks are also due to Mr. Cristian Czezatke, who provided some detailed information about Log-structured File Systems.
Last, but not the least, I would like to
thank all my friends and colleagues, who directly or indirectly
helped me by providing suggestions and valuable inputs from time
to time.
Abhishikt N. Jain.
E-mail: author@abhishikt.8k.com
Web Site : http://abhishikt.8k.com
(1999-2000)
INDEX
| S. No. | Title | Page No. |
| Acknowledgement | 2 | |
| 1 | Introduction 1.1 Unix File Systems (UFS) 1.2 Unix Fast File System (FFS) 1.3 Log-structured File System (LFS) |
5 |
| 2 | Some Techniques for High
Performance 2.1 Read and Write Caching 2.2 Disk Optimization Techniques |
7 |
| 3 | Log-structured File
System 3.1 Overview of Log-structured File System 3.3 Data Location and reading 3.4 Inode map |
9 |
| 4 | Free Space Management in
LFS 4.1 Using Threading 4.2 Using Copying 4.3 Combination of Treading and Copying 4.4 Segments (Mechanism to Manage Free Space) 4.5 Segment Cleaning 4.6 Segment Selection 4.7 Segment Layout 4.8 Log regions and segment summary blocks 4.9 Log region layout |
14 |
| 5 | Enhancing Performance of
LFS 5.1 Enhancing Performance 5.2 Effect of Segment size 5.3 Cleaning Policies 5.4 Cleaning Algorithms |
26 |
| 6 | Summary and Conclusion | 30 |
| Bibliography | 31 |
Chapter 1
Introduction
Computer program or system consists of several components, and if some of the components are made much faster while leaving the other components unchanged, then the unimproved components are likely to become a performance bottleneck.
Recent technology trends suggest that disk input and output may become such a bottleneck in the near future. CPU speeds are increasing dramatically along with memory sizes and speeds, but the speeds of disk drives are barely improving at all. Without new file system techniques, it seems likely that the performance of computers in the 2000s will be limited by the disks they use for file storage. For example, suppose that an application program today spends 10% of its time waiting for disk I/O. If CPU speed improves by a factor of 10 without any disk improvements, then the application will only run about 5 times faster and will spend about 50% of its time waiting for I/O. If CPU speed improves by another factor of 10, still without disk improvements, then the application will only run about twice again as fast, and will spend about 90% of its time waiting for I/O. The hundred-fold overall improvement in CPU speed will only result in a ten-fold improvement in application speed.
Fortunately, there are two technology trends that offer hope for beating the I/O bottleneck: large memories and disk arrays. First, as memories get larger and larger, it becomes possible to cache more and more file information in main memory and thereby eliminate many disk accesses. Second, disks are becoming much cheaper and physically smaller. This makes it practical to build disk systems containing tens or hundreds of drives (disk arrays). Disk arrays dont improve the performance of a single small access, but they offer much greater overall bandwidth and the possibility of many concurrent accesses.
This seminar is a discussion of
the technology trends related to I/O, the problems they may pose
for future engineering and office applications, and a set of
possible solutions. Our overall goal is to find ways of
capitalizing on fast-moving technologies (CPU speed, memory size,
disk array size) to compensate for slow-moving technologies (disk
seek times) so that future systems will not be I/O-limited [3].
1.1 Unix File System (UFS):
The basic storage in Unix is a file. A Unix file is represented as a sequence of bytes. Files also have attributes about the owner, access permissions, etc. Each file is uniquely identified as i-number. Unix uses a directory as a special type of file. The entries in directory may refer to other directories or files, this leads to a tree like structure. Unix system has fixed blocks of size 512-bytes.
For each open file a current offset into the file is maintained by the file system. The write system calls allows an application to transfer an arbitrary amount of data to a file starting at the current offset. The read system call allows an arbitrary amount of data to be read into memory buffer from the file starting at the current offset. The current offset is updated by both the read and write system calls. The application can change the offset without transferring data using the lseek system call. By explicitly setting the current offset before each transfer an application can perform non-sequential accesses to the file.
1.2 Unix Fast File System (FFS):
In the early 1980s a major revision of the Unix file system (UFS) was done at Berkeley [2]. The resulting file system, called the Berkeley Fast File System (Unix FFS), has become one of the most widely used and studied Unix file systems. Unix FFS changed the on-disk layout in a number of ways that improved performance. Changes included increasing the block size, changing the disk region layouts, and a new block allocation algorithm. This section presents the changes made in Unix FFS while the next section will review the storage management techniques that motivated the changes.
Unix FFS removes the limit that blocks were only 512 bytes in size. Blocks in the Unix FFS file sys-tem can be any power of two starting with a minimum size of 4096 bytes up to the limits imposed by the disk controller and drives. Unix FFS also adds the notion of a block fragment that only occupies a portion of a file system block. To reduce internal fragmentation the last block in a file is allowed to be one or more fragments rather than an entire file system block. The fragment size is set to be one-fourth or one-eighth the size of a block. Although having fragments complicates the design, fragments are necessary to avoid large amounts of disk space being lost to internal fragmentation. Studies showed that with a 4096 byte block as much as 50% disk space would be lost on normal workloads.
Unix FFS keeps the basic on-disk format of inodes (explained latter in section 3.4) and indirect blocks that was present in the initial Unix implementation. However, the inodes are no longer clustered together in their own region. Instead, Unix FFS breaks the disk into regions of adjacent cylinders called cylinder groups. Each cylinder group has its own inode region and data block region.
Unix FFS also has a new block
allocation algorithm and data structures. The freelist data
structure is replaced by a bitmap data structure that
records the allocated state for each block.
1.3 Log-structured file systems:
This dissertation describes a new disk storage management technique, called log-structured file systems, that is well suited for the technology and challenges facing the disk storage managers of the 1990s. Using a log-structured file system, storage write requests can be processed an order of magnitude more efficiently that current file systems; this allows the power of future processors to be fully utilized. The improvement in write performance is accomplished by writing all new information to disk in a sequential structure called the log. This approach increases write performance dramatically by almost eliminating disk seeks. By writing everything sequentially to disk, a log-structured file system uses the disk in the most efficient way possible. The sequential nature of the log also permits much faster crash recovery. Rather than scan the entire disk to restore consistency, a log-structured file system need only examine the most recent portion of the log. Crash recovery time in a log-structured file system does not grow with the disk capacity [1].
Log-structured file systems are based on the assumption that files are cached in main memory and that increasing memory sizes will make caches more and more effective at satisfying read requests. As a result, disk traffic will become dominated by writes. This condition is precisely the opposite of the read-dominated workloads that almost all-previous disk storage managers were designed for. The changed workloads require a write-optimized rather than read-optimized disk access technique.
The notion of logging is not new; it has been used extensively in database systems and a number of recent file systems have incorporated a log as an auxiliary structure to speed up writes and crash recovery. However, these other systems use the log only for temporary storage; the permanent home for information is in a traditional random-access storage structure on disk. In contrast, a log-structured file system stores data permanently in the log; there is no other structure on disk. The log contains indexing information so that files can be read back with an efficiency comparable to that of current file systems. The result is that a log-structured file system can provide the read performance of current read-optimized storage managers while greatly exceeding their write performance.
For a log-structured file system to operate efficiently, it must ensure that there are always large extents of free space available for writing new data. Keeping large extents of space available is the most significant challenge in the design of a log-structured file system. This dissertation presents a solution based on large extents called segments. A segment cleaner process continually re-generates empty segments by compressing the live data from heavily fragmented segments. The segment cleaner acts as a garbage collector that insures that there is room to append more to the log. The file system can be viewed as a log that is continually wrapping upon itself.
The segment cleaner uses a simple but effective algorithm to segregate older, more slowly changing data from younger rapidly changing data. The algorithm is based on cost and benefit and causes the older data to be treated differently from younger data during cleaning. Simulations show that this algorithm reduces the overhead of the segment cleaner substantially, allowing high performance for disks operating at the capacity utilization of current storage systems.
The concept of log-structured
file systems is demonstrated by a prototype log-structured file
system called LFS, which is in production use as part of the
network operating system. Benchmark programs demonstrate that,
for small files, the raw writing speed of LFS is more than an
order of magnitude greater than that of the commonly used Unix
file system. Even for other workloads, such as those including
reads and large-file accesses, LFS is at least as fast as Unix in
all cases but one (files read sequentially after being written
randomly). Long-term measurements of LFS in production use show
the inherent advantages of log-structured file systems. Overall,
LFS permits about 65-75% of a disks raw bandwidth to be
used for writing new data (the rest is used for cleaning). For
comparison, Unix systems can only utilize 5-10% of a disks
raw bandwidth for writing new data; the rest of the time is spent
seeking.
Chapter 2
Techniques for
High Performance
Access time of disk storage is the main problem for improving performance. So a number of techniques have been developed to improve the performance of disk storage managers. This section discusses several of the techniques, focusing on those that benefit the Unix or related environment. The first techniques presented involve the use of main memory to hide from the application the much slower magnetic disks. Techniques presented include caching and pre-fetching of data.
The second form of performance
improvement techniques presented focuses on reducing the disk
delays by accessing the disk with more efficient access patterns.
Techniques of this form include optimizing the layout of data on
disk to promote locality of storage that is likely to be
requested together. Contiguous allocation and clustering are
examples of such disk layout techniques. Disk efficiency can also
be improved by ordering the requests sent to disk using the disk
scheduling techniques.
2.1 Read and Write Caching:
Access time of disk very much greater than main memory, so main memory is used for caching disks. The applications that are frequently needed are cached in main memory so the time spent on read same thing from slow disk is saved. The Unix file system makes extensive use of read caching. All disk resident items such as files, directories, and metadata blocks can be cached in the disk cache. The Unix disk cache is maintained using Least Recently Used algorithm that exploits locality and reuse in file system access patterns. Studies have found cache hits rates of 80-90%, this rate is high enough to significantly reduce the disk delays experienced by Unix applications.
From the storage manager's view there is no disadvantage to read caching. The more memory devoted to caching, the higher the cache hit rate and the fewer disk accesses that are needed. From the whole system perspective there are disadvantages to read caching. Memory used to cache files can not be used for supporting the applications other memory requirements. Care must be taken not to give so much memory to the storage manager that applications can not perform acceptably. The data, which might be referred in future, can be brought to memory in advance. This will improve performance greatly. This, Pre-fetching, works well in Unix system.
Write caching is also known as write buffering, write behind, or write delayed writes, returns control to the application before the write work is done (i.e. data transferred to the disk). The main memory is used to buffer the write requests. Second benefit of write caching occurs if multiple writes are done on same block, in this case changes in contents of the block is done in the buffer only and the final contents are transferred to the disk.
The sync system call occurs every 30 seconds in Unix system. This limits the data loss in case of power failure or a crash. Because of this, any changes that are older than 30 seconds will not be lost in a crash. This provision allows the file system to get some of the benefits of write caching without arbitrary long exposures to data loss.
To provide additional guarantees to applications, the Unix FFS implementation added a new system call, fsync, that allows an application to force all modifications to a file from the disk cache to disk. Once an fsync call returns, an application knows that all modifications to the file have been made it safely to disk. Applications such as text editors have been modified to use the fsync call to avoid loss of data during crashes.
2.2 Disk Optimization Techniques:
Once the main memory of the
computer has been fully utilized to improve performance, the next
area of focus is to minimize the time an application spends
waiting for the requests that have go to disk. Disk request times
can be reduced by reducing overheads such as time spent moving
the disk head (disk seeks) and time spent waiting for the storage
to rotate underneath the disk head (rotational delay). This
section describes the general approach used by storage managers
to reduce overheads in disk requests. The first approach is to
control the layout of storage on disk so that requests can be
processed with the least over-head. The second approach is to
schedule the applications disk requests to reduce the
overheads.
Contiguous allocation works by
allocating objects (i.e. files) to consecutive sectors on disk so
that sequential accesses can be performed with minimal delay
between blocks. When objects are read sequentially, the request
for the first block suffers a delay but all blocks after that
proceed at the best-case rate. Another Advantage of contiguous
allocation is that the size of an index used to find the blocks
of an object is small. The location of an entire object can be
described by a single disk pointer to the start sector of the
object. The index is much smaller than the Unix technique that
has at least one pointer per block. Smaller indexes mean that
less data needs to be stored and examined to locate blocks of an
object.
If more than one disk request is
outstanding at a given time then the storage manager can process
the requests in an order that reduces disk overheads. For example
if the disk is presented with 10 requests that alternate between
disk track 0 and disk track 100 the storage manager can reorder
the requests so that all the requests are processed on track 0
before moving to track 100. This would result in only two disk
seek delays, one to track 0 and one to track 100, rather than ten
seek delays that would occur if the requests were processed in
the order given. This scheduling of the disk, called disk
scheduling, can reduce the seek delays, number of seeks, and
rotational delays. It is found that sorting large numbers of
random requests can increase the disk efficiency from around 7%
to 25% of the disks maximum bandwidth. Achieving 25% of the
maximum disk bandwidth required a thousand requests to be queued
and running of a complex, CPU intensive sort function that
considered both seek delays and rotational latency.
Chapter 3
Log-structured
File System
3.1 Overview of Log-structured File System
The fundamental idea of a log-structured file system is to improve write performance by buffering in the file cache all modifications to the file system and writing all the changes to disk sequentially in a single disk transfer operation. The information written to disk in the write operation includes file data blocks, file attributes, index blocks, directories, and almost all the other information used to manage the file system. By using this format, a log-structured file system can obtain high write performance regardless of the request size or access pattern of the workload. The log-structured file system gets its name from the similarity of its disk layout to that of a log in the logging techniques. In both cases the disk storage manager writes modifications sequentially to disk. The chief difference between the systems is that the log in a log-structured file system is the only data structure on disk.
Having the log as the only data structure on disk allows the log-structured file system to avoid the second disk write to update the data structure that is needed in systems that use write-ahead logging. Elimination of this second write allows log-structured file systems to utilize nearly 100% of the raw disk bandwidth on all workloads including those for which standard logging techniques can achieve only 25% to 50% of the bandwidth.
Although the basic idea of a log-structured file system is simple, there are two key issues that must be resolved to achieve the potential benefits of the log-as-the-only-data-structure approach. The first issue is how to retrieve information from the log. For traditional logging systems, the log is never read during normal operations and only read sequentially during crash recovery. A log-structured file system would have unacceptable read performance if this same design were used. If a sequential search of the log were required to retrieve information, the system could suffer large delays from reads that miss in the file cache.
The second major issue of
log-structured file system design is how to manage the free space
on disk so that large extents of free space are always available
for writing new data. One way to view this problem is that the
log of the log-structured file system quickly fills the disk and
wraps upon itself. During steady-state operation the log of a
log-structured file system is continuously wrapping upon itself.
The key issue in the design of a log-structure file system is to
handle this log wrap problem as
efficiently as possible.
3.2 Data Location and Reading:
The log structure of the log-structured file system makes the allocation of data on disk trivial; data blocks are allocated and written when the log is written to disk. The challenge of file I/O in a log-structured file system is in the algorithms for retrieving data from disk. This section discusses the design techniques that allow log-structured file systems to quickly retrieve data from disk. Although the term log-structured might suggest that sequential scans are needed for retrieval operations in a log-structured file system, this is not the case and it would lead to unacceptable performance in most environments. For example, the disk technologies currently used in the environment have capacities around one gigabyte and maximum transfer rates of two megabytes per second. Using these disks a sequential scan over the entire disk would take over eight minutes. Even with a cache read hit rate of 99.9% (which is much higher than any current file system achieves) the average read time would increase by 480 milliseconds, a factor of ten slower than the current storage manager provides. Clearly, retrieval mechanisms which increases the file cache miss cost five orders-of-magnitude are unacceptable.
The alternative to sequential scan is to have the log-structured file system maintain an index within the log which can be used to speed retrieval for requests that miss in the file cache. The index allows the storage manager to locate and read data items written anywhere in the log without searching. Every data transfer to the log will modify this index so its update must be efficient. To achieve this efficiency, the modifications to the index must be written to the log along with all other modifications.
To implement the fast retrieval needs of Unix, the LFS prototype builds index structures similar to those used in Unix FFS. For each file, LFS allocates an inode data structure that, like the Unix inode, contains the files attributes (type, owner, permissions, etc.) and the disk addresses of the first ten blocks of the file. LFS also uses the same indirect blocks and double indirect blocks that as Unix for handling large files.
The main difference between the index structures of LFS and Unix FFS is that LFS inodes are not in fixed locations on disk. This differences is key to reducing the number of seeks of workloads containing small file modifications. When the inodes are in fixed locations updates to inodes, directory blocks, and the files data blocks cause non-sequential and small transfers to disks; precisely the type of access patterns that cause poor performance. Figure shows how the attributes, data, and index blocks of a file are allocated in the log in LFS and compares the disk layout of LFS with that of the Unix FFS.
Using the same indexing data structures as Unix FFS had several advantages for the LFS prototype. It was simple to integrate into the existing kernel and provided opportunities to reuse code from s original Unix FFS-like file system. Furthermore, keeping the indexing structure the same meant that behavioral differences between LFS and Unix FFS could not be attributed to the index mechanism. This was useful in the evaluation of the log-structured file system.
Figure 1: A comparison between LFS and Unix
FFS:
Creating two
single-block files named dir1/file1 and dir2/file2.
Each system must write new data blocks and inodes for file1
and file2, plus new data blocks and inodes for the
containing directories. Unix FFS requires ten non-sequential
writes for the new information (the inodes for the new files are
each written twice to ease recovery from crashes), while LFS
performs the operations in a single large write. The same number
of disk accesses will be required to read the files in the two
systems. LFS also writes out new inode map blocks to record the
new inode locations
3.3 Inode map:
Although accessing file data given the inode is the same in Unix FFS and LFS, accessing the inode is slightly more complex in LFS because of the code needed to locate the inode. In Unix each inode is at a fixed location on disk; given the i-number for a file, a simple calculation yields the disk address of the files inode. In contrast, LFS doesnt place inodes at fixed positions; they are written to the log.
To locate inodes quickly LFS uses a data structure called an inode map that maintains the current location of each inode. The inode map is structured like a large array indexed by the files i-number. Each entry in the array contains the current disk address of one inode. To fetch an inode for a file, LFS indexes into the inode map to compute the disk location of the inode. Once the address lookup is done, file inode and data block access are identical to Unix FFS. The same inode and indirect block structure means that once the lookup is done the same code is executed and the same number of disk blocks are accessed.
Since the LFS inode map replaces what was a simple address calculation in Unix FFS, the data structure implementing the inode map should be both reliable and fast. Should the inode map become lost or corrupted, all files will be unaccessible or only available after very time-consuming sequential scans. The inode map is queried and updated frequently meaning these operations need to be fast. Ideally the 31 inode map lookup should take no longer than the address calculation done in Unix FFS.
To achieve high performance, LFS divides the inode map into blocks that are written to the log much like file data blocks. Writing the inode map to the log allows modifications to the map to be written to disk using the normal high write performance of the log. No extra disk seeks are needed to update the inode map.
The inode map needs to be reliably recovered after system crashes and shutdowns. To support fast startup of a file system, the locations of the inode map blocks are kept in a fixed region on disk called the checkpoint region. At file system attach time, the checkpoint region is read to find the location of all inode map blocks.
With the inode map divided into blocks, LFS can support high performance lookups in the inode map by caching frequently accessed blocks of the inode map in the same memory area used to cache file blocks. The key to high performance lookups in the inode map is to have the hit rate on the inode map be high enough so the average cost is close to zero disk accesses per lookup. To achieve this high hit rate requires that the active portion of the inode map fit comfortable in memory.
The amount of memory needed to successfully cache the inode map depends on the size of the map and the locality of the lookups. Size of the inode map depends on a file system creation parameter specifying the maximum number of inode map entries. Each inode map entry is 12 bytes in size and contains the current address of the inode, a flags field, a version number, and the time of last access of the file.
Although the inode map entry only needs the inode address to locate the file, having version number and the last access time in the inode map rather than the inode means that these fields can be examined and updated without fetching or updating the inode. This is particularly important for the access time, which is updated on every read operation to the file.
By default, the maximum number of inode map entries is set so there can be one entry for each 4 kilobytes of space. This means that the maximum size of the inode map will be equal to 0.29% of the disk space. Currently the size of main memory of the file servers is around 5% of the disk space on the machine. Thus the entire inode maps for all LFS file systems fit comfortably in main memory on the file servers.
The preceding argument about the maximum size of the inode map is based on the overly pessimistic assumption that the file systems have an average file size of 4 kilobytes (The average size on the file system range from 9 kilobytes to 60 kilobytes.) File systems with a larger average file size will have fewer inode map entries. Inode map blocks that contain no allocated entries are not given space in memory or on disks. The inode map can be as small as a single block for file systems with few files.
To allow the inode map to require even less space in the file cache, the allocation of files to i-numbers is done so that files in the same directory are given i-numbers close to each other. The flags field of the inode map entry indicates if the i-number is allocated. Directories are allocated an inode map entry at random and files created in the directory start at the directorys inode map entry and scan forward until the first available entry is found. This allocation policy means that normally all the files in a directory will reside in the same inode map block. Locality of references to files in the same directory causes locality in the references to inode map blocks. The inode map blocks containing entries from the directories currently being referenced will be in memory. The blocks containing entries for directories not currently being referenced are not needed in memory.
The combination of locality of reference and a larger average file size on the production LFS systems means the hit rate for inode map reference is very high. Hit rates range from 99.1% to 99.9% on the production file systems even though on average only 15% of the maximum inode map size is resident in the file cache (i.e. only 3.6 megabytes of memory are used for caching inode map blocks of 8.1 gigabytes of disk space). Most misses occur because of cache cold-start effects when the file systems are first mounted after a reboot.
As technology scales up the size of the disk storage, the size of the inode map will increase but so will the size of the memories available to cache the blocks. Even with very large (1024-gigabyte) disk stores it seems unlikely that the inode map block cache miss rate will increase significantly.
Chapter 4
Free Space
Management in LFS
The most challenging design issue
for log-structured file systems is the management of free
space. The goal is to maintain large contiguous free regions
on disk so that the log can be written with large sequential
transfers. When the file system is created all the free space is
in a single extent on disk. By the time the log reaches the end
of the disk and wraps around on itself, the free space will have
been fragmented into many small extents corresponding to the
files or blocks hat were deleted or overwritten. From this point
on, the file system has two choices: threading or copying. Either
the log can be threaded around the still live blocks or the live
blocks can be copied out of the way. This section discusses the
two approaches in detail and presents the combined
threading-copying approach used in LFS.
4.1 Free Space Management using Threading:
The threading technique calls for the live blocks to be left where they are: new log blocks are written into the free spaces. Figure shows how a log-structured file system using threading might look. The disk is treated as a large circular list of blocks and the log is written to the next free block. The chief benefits of threading are its simplicity, the ability to operate at high disk capacity utilizations, and low free space management overheads when compared to the copying approaches. By scanning over the disk and writing to the free blocks a threaded log-structured file system can operate until the disk is 100% full. Keeping track of the allocated blocks is straightforward. A simple bitmap data structure like the one used by Unix FFS will work. The next block for writing the log can be located by scanning the bitmap in a circular fashion. The storage manager can continue operating as long as there is an unallocated block on the disk. This can be contrasted with some of the copying schemes under which it is prohibitively expensive to operate when the disk is nearly full.
The main problem with threading is that the free space on the disk will become fragmented so that large sequential transfers to the log are not possible. In a threading system, the large contiguous pieces of free disk space are overwritten with blocks of the log, which are later overwritten or deleted to deallocate the space. Unless all data is overwritten and deleted, the large pieces of disk space will be broken into smaller pieces by the blocks that survive. These surviving blocks must be skipped over when the log wraps. The time spent skipping over blocks can not be used for transferring data to the log.
The reduction in performance caused by skipping over the live blocks depends on the number of live blocks and how the live blocks are clustered on disk. The number and clustering of live blocks is in turn dependent on the storage capacity utilization and workload. Operating under a low storage capacity utilization means that the disk contains few live blocks to skip over when the log wraps. For example, a log-structured file system running with only 10% of the disk allocated to live blocks will on average write 9 out of every 10 blocks that the log wraps on. Using the assumption that the disk can skip over a block at least as fast as it can write the block, write performance of at least 90% of disk maximum bandwidth should be obtainable regardless of the workloads access pattern. This example can be generalized to produce an estimate of the maximum bandwidth as a function of the disk capacity utilization.
Although a threaded log-structured file system can operate until all disk blocks are in use, the estimate for disk write performance indicates that performance will degrade as storage capacity utilization increases. Consider a disk operating at 80% storage capacity utilization, a common figure in the environment. The log-structured file system would have to skip over 8 blocks for every 2 blocks written thus achieving only 20% of the disk maximum bandwidth. As the disk fills the write performance of a threaded log-structured file system drops.
Figure 2: Managing Free Space by Threading in log-structured file system:
This figure shows the
before and after disk image of log wrap on a threaded
log-structured file system. With threading the log skips over the
blocks that are still live and overwrites those blocks that have
been previously deleted or overwritten. Pointers are needed to
link together the disjoint pieces of the log.
The above estimate of write
performance may be overly pessimistic for systems operating at
high storage capacity utilizations. It is possible that the
workload overwrites or deleted files in such a way that the live
blocks are clustered with other live blocks and the dead blocks
are clustered with other dead blocks. This clustering would allow
the log-structured file system to write in large transfers for
high bandwidth and use fast seeks to skip over data.
Unfortunately such clustering is unlikely in practice. The
clustering of live blocks is controlled by what data is written
together to the log and the clustering of dead blocks is
controlled by the pattern of overwrites and deletions generated
by applications. Neither of these clusterings is under the
control the storage manager so the performance of threading
depends on the workload. The clustering needed for high
performance threading relies on application with perfect locality
in the write access patterns such that data that is written
together is always overwritten or deleted together. This type of
access allows the log to be threaded though the clusters of dead
blocks and skipped over for the clusters of live blocks.
Unfortunately, although the office/engineering environment has
significant locality, it is not perfect. Most but not all files
written at the same time will be deleted or overwritten together.
The result is that over time the few that live will cause the
free space to become heavily fragmented. The large contiguous
disk transfers needed for high performance will not be available.
4.2 Free Space Management using Copying:
An alternative to skipping over live data during log wrap is to move the live blocks somewhere else. The live data can be read in and written back to disk in another location, allowing the log to be written to the newly cleared location. Copying allows a log-structured file system to avoid the fragmentation created by threading. By copying data out of the way, the log can always be written in large contiguous pieces. Figure shows how a log-structured file system using copying might look.
The disadvantage of copying is that the copying overhead reduces the amount of the disk bandwidth available for application data. The time spent reading the live data in and writing it back out can not be used for reading or writing new data. Several factors control the amount of copying needed and the impact of this copying on storage manager performance. The factors include how the copying is implemented and the application workload. This relationship between workload, implementation, and performance is examined in detail in next chapter. One fundamental issue that must be addressed in a copying scheme is where to copy the data. Possible solutions include copying the live data back in a compacted form at the head of the log, copying into another log-structured file system to form a hierarchy of logs, or copying the data into some totally different file system or archive. Each approach has its advantages but they all share the common disadvantage that all long-lived data is written to disk multiple times. Valuable disk bandwidth for writing new data is lost to these multiple writes.
Copying the data back into the log is the simplest approach; only one data structure, the log, is needed. Under this approach the log wrap is dealt with by reading in the live data being wrapped on, compressing it into an unfragmented piece, and writing it back into the log. The mechanism to implement this form of copying is simple but it suffers the disadvantage that long-lived data is continuously copied every time the log wraps upon it.
The other two approaches attempt to reduce the amount of copying by removing the data from the main log when it is wrapped upon. By copying the data into another log, a hierarchy of log-structured file systems can be formed with each log containing data of similar age. The hope is that the older the data in the log the slower it will be changing and the less copying will be needed. The hierarchy of logs approach lessens the problems of the single log approach because long lived data is not copied on every wrap of the main log. It still performs multiple copies of long-lived data as the data moves through the layers of the hierarchy.
An alternative to the hierarchy of logs approach is to copy the data into another non log- structured file system. Any files that live long enough to survive the first log wrap would be copied into a separate storage area. This second file system could use a format that works well for slowly changing read-mostly workloads. The hope is that the first log-structured file system would slow the rate of modifications to the point that the read-optimized file system could absorb the changes. Copying into a separate file system limits the number of copies of long lived data to two copies.
Figure 3: Managing Free Space by Copying in log-structured file system:
This figure shows the
before and after disk image of log wrap on a copying
log-structured file system. Blocks that are still alive with the
log wraps are read, compacted, and written back to the log.
The disadvantage of Copying approach
is that it suffers the same multiple copy problem of the
hierarchy of logs. Every long-lived byte must be copied at least
once. An additional complexity is caused by maintaining two
different storage managers, a log-structured one and a
read-optimized one.
4.3 Combination of Threading and Copying:
Using a pure Threading approach, a log-structured file system loses performance to fragmentation, as large sequential transfers are not possible. Using a pure Copying approach, a log-structured file system can control the fragmentation but loses performance to the copying overhead. What is needed is a combination of these techniques that uses Copying to control the fragmentation and uses Threading to control the copying overhead.
The solution to the log wrap
problem used in the LFS prototype employs a combination of Threading
and Copying. LFS copies live data back into the log to
control fragmentation. When fragmentation is not a problem, a Threading
approach is used to quickly skip the unfragmented log sections.
The design has the advantages of both the Threading and Copying
approaches without the fragmentation overhead of Threading
or the copying overhead of Copying. So Threading is
used when disk utilization is low, and Copying is used
when disk utilization is high. Next section describes the
mechanisms used to implement this hybrid approach.
4.4 Segments (Mechanism to Manage Free Space):
To ensure that large contiguous chunks of disk space are available for log writes, LFS divides the disk into large fixed-size extents called segments. The log is written to segments sequentially from the segments beginning to its end, and all live data must be copied out of a segment before the segment can be rewritten. Segments also form the unit of the log threading. Once the log fills the current segment the log writing moves to another segment. Any segment may be chosen to be the next segment provided that it contains no live blocks.
To ensure a constant supply of segments available for writing the log, LFS uses copying to move the live blocks from heavily fragmented segments. This copying, called Segment Cleaning, generates empty segments by copying all the live blocks out of one or more of segments, combining the blocks together, and writing them to the log. The segment cleaning mechanism is described in detail in the next section.
Segments and the segment cleaning mechanism also allow slowly changing data to be collected together and skipped over by the log. Long-lived and unchanging data living in a segment can be ignored by the log writing mechanism. This allows LFS to achieve both the benefits of threading as well as the benefits of copying.
The LFS design calls for segments
to be large so that the transfer time to read or write a whole
segment is much greater than the cost of a seek to the beginning
of the segment. Thus whole-segment operations run at nearly the
maximum bandwidth of the disk regardless of the order in which
segments are accessed. If we increase segment size then
percentage of bandwidth achievable also increases as access delay
(seek time + rotational delay + controller overhead) becoming
negligible compared disk write-time. But large segments will
create problems for segment cleaning. So we have to select
segment size to precise.
4.5 Segment Cleaning:
Because LFS writes entire segments at once, only segments that are known to contain no live blocks can be written. LFS refers to such empty segments as clean segments and only permits the log to be threaded through clean segments. To ensure a constant supply of clean segments, LFS uses a mechanism called segment cleaning that generates clean segments from segments containing live blocks. The segment cleaning mechanism works by copying all the live blocks out of one or more segments, combining the blocks together, and writing them to clean segments as part of the log. Once the data has been written back to the log, the original segments are clean and can be used for new data or for additional cleaning. The live data that was previously fragmented is now compacted together in the segments written during the cleaning operation.
The mechanisms used to implement segment cleaning in LFS are simple and flexible. Segment cleaning is implemented as a three-step process: read a number of segments into memory, identify the live data, and write the live data back to a smaller number of clean segments. This section details the basic mechanism used to implement segment cleaning. The segment cleaning mechanism is controlled by several policies that guide decisions made during the segment cleaning process. Policies control events such as when cleaning is started and stopped, which
Figure 4: Segment cleaning in LFS:
In the top figure the
log is seen wrapping upon itself. Then segment cleaning occurs,
the fragmented segments are read, compacted, and written back to
other segments. After segment cleaning the log has room to grow
though the newly cleaned segments. Note that the log can skip
over non-fragmented segments.
segments are selected to be cleaned, how the live data of segments is combined together after being cleaned, and where the combined data is written.
The LFS implementation treats segment cleaning much as if the live blocks of the segment being cleaned were rewritten with the same data by an application program. Live blocks are brought into the file cache and written back to a segment in the log. Because of the similarity, segment cleaning in LFS is able to share much of the normal file system code.
The key difference between segment cleaning and normal file modifications is that during segment cleaning it must be possible to identify which blocks in a segment are live. For the live blocks it must also be possible to identify the file to which each block belongs and the position of the block within the file. This extra information is needed to update the files inode or indirect block to contain the new location of the block when it is written out. Although most file systems maintain an index mapping a files block to its location on disk, few support the reverse map from a disk block to its inode. When this reverse mapping is done such as during the disk scavenger program of the Unix file system, it is a very time-consuming process requiring scanning the inodes and indirect blocks of the file system. Using this mechanism would be unacceptably show for LFS.
Live block identification during segment cleaning is implemented in LFS by making the contents of a segment self-identifying. When a segment is written, enough additional information is added so that the segment cleaning mechanism can identify each block of a segment. For example, the additional information for a file block contains the files i-number and block number within the file. When the block is moved during segment cleaning this additional information is used to find and update the files inode to reflect the blocks new location.
Making segments self-identifying allows the segment cleaner to know what a block is or what it was but does not tell if it has been overwritten or deleted. The cleaner still needs a test to determine if a block is still live. LFS does this by checking the inode or indirect block to see if the block in question is still part of the file. If the block is pointed to by the files block index it is still live; otherwise it can be ignored during segment cleaning.
One advantage of using this lookup mechanism is that it eliminates data structures such as bitmaps or freelists that are commonly used in file systems to track blocks in use. Having no bitmap or freelist benefits both the normal running of the file system and the crash recovery algorithm. During normal operation such data structures are constantly being modified, requiring CPU and disk I/Os. The crash recovery algorithm must be coded to restore the data structures to a consistent state before the file system can be used again.
However, elimination of the bitmap means that LFS does not know which blocks in a segment are live until it reads the segment. This precludes any optimizations to the cleaning mechanism that attempt to read only the live blocks of segments. With the bitmap, the cleaning mechanism could be optimized to read only the live blocks as indicated by the bitmap.
The value of this lost optimization depends on the clustering of live blocks in the segments cleaned. In order for the optimization to be a big win, the implementation must be able to read the live blocks significantly faster than reading the entire segment. For current disk technology and CPU technology, this only occurs when a segments live blocks are clustered together into a few contiguous groups of blocks that can be read in a few transfers. Between the high disk controller overhead and the CPU time necessary to start a disk I/O, this is over 5 milliseconds of overhead per I/O. This overhead combined with missing disk revolutions between transfers means that when the number of transfers is greater than one per track it will not be faster than reading the entire segment.
A second disadvantage of the test LFS uses to determine if a block is live is that fetching the inode or indirect block may involve additional disk I/Os. If the inode or indirect block that describes a block is not in the segment being cleaned and not in the file cache, the cleaning mechanism must fetch it from disk to do the liveness test. In the worst case the lookup could cause several additional disk I/Os for each block examined. To keep the cost of cleaning low it is important that such random disk references be avoided.
In LFS, extra disk references will occur during the test only when the blocks index is neither in the file cache nor in the segments being cleaned. For workloads with small files and those with large files written sequentially the index is likely to be in one or both of these places. For small files, the file data and inode data are likely to be modified and written together into the same segment. Reading the entire segment reads both the files blocks and inode so no additional reads are required to check the block.
For workloads with large files written sequentially, the segments will contain many consecutive blocks of the same file. Checking the blocks will require few random disk I/Os because many of the blocks will be mapped by the same indirect blocks and inode. The random I/O request to fetch an indirect block will be amortized over many lookup requests using the same indirect block.
The workloads under which the lookup test causes the most random I/Os are those in which the index blocks and inode are separated from the blocks they map. This happens when large files are written non-sequentially and with little locality of reference. The data blocks mapped by a single indirect block will be scattered over the disk causing random I/Os to fetch the index when cleaning.
Using knowledge of the workload in the environment, LFS further improves cleaning performance by maintaining a file version number that is updated every time a file is deleted or truncated to length zero. The version number allows LFS to determine quickly and without additional I/Os whether a block is part of a file that has been totally overwritten or deleted. The version number for a file is kept in the files inode map entry (see Section 4.2.1) and included in the summary information for each of the files blocks in the segment. During the liveness test a quick comparison of the version number in the summary information and the inode map can determine if the block has been deleted.
In the workload environment of
the LFS prototype, large-file random I/O without locality is rare
so the lookup test performs well. With the truncate version
number eliminating most of the inode lookups for deleted files,
the remaining fetches of inodes require little disk I/O.
Excluding the swap1 file system, only 1% to 5% of the inode
fetches made during segment cleaning required any disk I/O. The
needed inodes were either found in the file cache or in the
segment being cleaned. The swap1 file system with its
non-sequential access patterns to large files had a much higher
number of disk I/Os during segment cleaning. Around 18% of the
inode fetches missed in the file cache and required a disk I/O to
fetch the inode block. The experience with swap1 suggests that an
additional mechanism might be needed to reduce the extra disk
I/Os during segment cleaning on file systems with workloads
containing large files accessed non-sequentially.
4.6 Segment selection:
Although the cleaning mechanism described above will work for any segment, some segments are better than others for efficient operation of the system. Segments that are heavily fragmented make a better choice than those that contain little free space. To guide in the selection of segments to clean and to keep track of the segments that are clean, LFS maintains a data structure called the segment usage table.
For each segment on disk the segment usage table contains an entry that describes the state of the segment, the number of live bytes in the segment, and an estimate of the age of the youngest block in the segment. The state bits tell if the segment is clean and therefore available for writing new data. The live byte count and the age are used by the policy for selecting segments to clean. Its usage is described in next chapter.
The segment usage table is modified by any operation that adds or deletes data from disk. The entry for a segment is updated when the log is written to the segment and when files are truncated or deleted. Entries are also updated when a block of a file or directory is rewritten causing the old block to become free.
Like the inode map, the segment
usage table is broken into blocks that can be cached in the file
cache. The size of the segment usage table depends on the segment
size and the size of the disk. With the 512-kilobyte segment size
of the current file system, the segment usage table occupies 48
kilobytes per gigabyte of disk space.
4.7 Segment layout:
This section presents the part of the LFS design that controls the organization of data on disks. The disk layout is important because it affects not only the read performance of LFS but also the segment cleaning overhead, crash recovery time, and space efficiency. This section describes layout of data in segments and the mapping of segments onto the disk media.
Because both write performance and cleaning overhead depend on efficient whole-segment transfers, segments need to be mapped onto the disk so that whole segment disk transfers are efficient. Segment size and alignment should be adjusted so that segments correspond to natural disk transfer units such as sets of tracks or cylinders. Such placement allows high bandwidth transfers that are not slowed by extra seeks or rotational delays. For disk arrays, the alignment and size should take the redundancy overhead into account so that segment write operations avoid the costly read/modify/write cycles needed for non-stripe-aligned accesses.
Once the segments have been sized and aligned for efficient transfers to and from disk, the LFS storage manager can focus on the placement of data inside segments. Issues that influence the layout of data in a segment include the read performance, data identification, and end-of-log detection.
Read performance:
The whole-segment transfer mode used by log writes in LFS means that the write transfer performance does not depend on how the data is organized in a segment. The important performance metric driving the segment layout is the performance of read requests. Data of a segment is optimized to exploit common read access patterns to increase performance. In the environment the most common access pattern is sequential whole-file access.
Data identification:
The segment must contain the additional information required for self-identification. This information should be placed in the segment so that the overhead in terms of log bandwidth and storage efficiency is low and the speed of identification during cleaning is fast.
End-of-log detection:
A basic problem in a logging
system is how the crash recovery algorithm detects when it has
reached the end of the log. The mechanism used must work
regardless of when the system dies while writing the log. For LFS
this means it must be able to detect when a log write to a
segment only partially completed because the system crashed.
4.8 Log regions and segment summary blocks:
The identification of a segments contents and the end-of-log detection algorithm are complicated by the need to write partial segments as well as whole ones. Partial-segment writes are needed when modifications to the file system are insufficient to fill an entire segment yet still must be written to provide reliability guarantees (e.g. a client has invoked the fsync kernel call). It must be possible to fill in a segment piece-wise with each piece self-contained.
LFS addresses this need by dividing segments into contiguous extents called log regions. A log region contains zero or more data or metadata blocks combined with a segment summary block that contains the additional information describing the contents of the region. The log region is the smallest unit in which blocks are written to the log. It can range in size from one block to an entire segment. Figure shows the layout of a log region containing several files.
Besides containing the self-identifying information for a segment, the segment summary block also contains pointers used to link log regions together to form a logically sequential log. The links are used to connect log regions in different segments as well as those within a single segment. Each segment summary block starts with a header containing the disk address of the previous segment summary block and the address of where the next segment summary block will be written. A program can follow these pointers to traverse the log in both the forward and backward directions.
Using the summary block approach
for data identification requires that two different blocks, the
summary block and the data block, be read to identify the data
block. The alternative approach would be to make every block
self-identifying so the summary block would not be needed. This
approach was rejected because it would require stealing bytes
from file data blocks to provide the identification information.
Since most application make file system block size requests,
stealing as little as one bit from a block will cause application
request to access multiple disk blocks.
4.9 Log region layout:
With the segment summary blocks taking care of the identification and end-of-log detection, the remaining issue for the log layout on disk is the arrangement of the data for fast retrieval and high space efficiency. During the log writing process, LFS assembles an image of the log region in memory. It is not until this time that the disk addresses for the blocks in the log region are known. The layout code must place blocks into the region and update the indexes pointing at them. The rest of this section details the log region layout algorithms.
Once initiated, the log region layout algorithm runs as a separate kernel-resident process for each file system that needs to write data. The file cache code hands the layout algorithm the files whose attributes, data blocks, or block indexes have changed. The layout algorithm retrieves the modified blocks and attributes from the cache and builds an in-memory image of the log region organized as a list of pointers to blocks in the file cache. When all the modifications are placed in the log region or the log region fills it is transferred to the current segment being written.
To help read performance, the log region layout algorithm clusters together the modified blocks from the same file. This is implemented by placing into the log region all the modified blocks of a file before moving on to the next file. For files written sequentially, this algorithm will allocate the blocks of the file contiguously and in order in the log region. Sequential read requests to the file can then be processed sequentially from disk.
For each block placed in a log region, the algorithm must update two data structures: the segment summary block and the index of the block. The addition of a block in a log region is recorded in an entry added to the summary block of the log region. This entry contains the i-number of the file, the block number being added, and the truncate version of the file from the inode map. Placing the block in the log region also determines the disk address at which the block will be written. The inode or indirect block is updated in the file cache to reflect this new address for the block.
When placing a file in a log region, the data blocks are placed first followed by the indirect blocks followed by any double indirect blocks and finally the inode. Placing blocks in this bottom-up hierarchical fashion means that blocks are never modified after being placed in a log region. Furthermore, the layout algorithm is the same for indirect blocks as it is for data blocks. Placing an indirect block causes the next higher level of index to be updated in addition to adding an entry to the segment summary block. In the entry, negative block numbers are used to distinguish the index blocks from the files data blocks. These block numbers are assign using a depth first scan of the tree of indirect blocks (i.e. the single indirect blocks is -1, the double indirect block is -2, the children of the double indirect block are -3 through -1026, etc).
The final step of a files layout is to place the inode in the log region and update the files inode map to point at this new inode. This step is both important for crash recovery and problematic for the layout code. Until the inode is written and the inode map updated, the blocks written to the log region will not be visible on disk because it is not part of the files index rooted at the inode. The crash recovery algorithm described in next chapter uses this property so that the inode acts like a data base commit record that decides when to switch the on-disk version of a file to include the newly written changes.
The problem with placing the inode is caused by its small size. The inode in LFS is only 128 bytes so allocating an entire 4-kilobyte block or even a 512-byte sector will waste much disk space due to internal fragmentation. To reduce the fragmentation, LFS clusters the inodes of a log region into a block called an inode block. The size of the inode block is a file system settable parameter that defaults to 4 kilobytes.
Once all the modified blocks of the file have been placed in a log region, the layout algorithm places the inode of a file into the inode block and updates the inode map entry of the file to point at the inode block. Inode blocks are allocated in the log region whenever an inode needs to be placed and no space exists in the current inode block. For log regions with many small files, the files inodes will be placed into a few inode blocks with low fragmentation. If need be, all blocks of a log region can be allocated to data blocks with no inode blocks.
On the prototype LFS file systems with four-kilobyte inode blocks, on average only half the entries in the inode blocks were in use. The distribution indicated that having a smaller inode block would not have helped much because the blocks were either nearly full or contained only a few live inodes. One of the reasons for this was that the crash recovery mechanism of the distributed file system modifies the inodes of the files recently used by a client machine when the client crashes. This caused more inode modifications than LFS anticipated. In hindsight, the crash recovery version number should have been put in the inode map like the files access time.
Clustering the inodes together into an inode block has benefits beyond reducing fragmentation. By fetching the entire inode block when a reference to an inode in the block occurs, LFS implicitly pre-fetches the other inodes in the block. Any locality of reference to inodes in the same inode block is exploited. The inode blocks are cached like file blocks in the file cache. Since the higher levels of the file system cache individual inodes, the inode blocks act as a second-level cache.
On the production LFS file systems, most access to inodes find the inode block already in the cache. Except for the swap1 file system, the cache-hit rates on the inode blocks are in the range 90% to 96%. The hit rate on the swap1 file system was only 75% for two reasons. The first reason was that swap1 had relatively few files so the caching of inodes in the higher levels of the file system was very effective. Only the cold start misses referenced the inode blocks in the file cache. The second reason was that swap1 also suffered worse fragmentation in the inode blocks. The average number of inodes was only four per block and most (60%) inode blocks contain less than three inodes. The heavy fragmentation of the inode blocks meant that the caching was ineffective for swap1.
To reduce fragmentation and
improve space utilization, the LFS log region layout algorithm
allows the last block of a file to occupy less than a full
4-kilobyte block. The number of sectors allocated to the last
block of a file is rounded to an integral number of 512-byte disk
sectors. Like the Berkeley Unix FFS, allowing the last block to
be a fragment greatly reduces the intra-file fragmentation in
file systems with many small files that are not integral numbers
of blocks. It is also possible to round fragments to even smaller
units than disk sectors and achieve greater fragmentation
reduction. This was not done in the current LFS implementation
because addressing the disk on more than sector boundaries would
require increasing the size of disk pointers.
Chapter 5
Enhancing
Performance of LFS
5.1 Enhancing Performance:
As we know improving Write
performance of LFS was designed to provide high performance for
write through large batched disk transfers. However, additional
research demonstrated that cleaning overhead could result in
dramatically lower write performance for some workloads.
Self-tuning principles can be applied to LFS to provide high
write performance across a broader range of workloads.
The cleaner reads an entire
segment even if a few live blocks remain. It may be beneficial to
allow the cleaner to read only if the live blocks when that would
take less time, but following the original LFS, we did not
include this optimization. Ideally, the data would be written
once and never moved bt the cleaner; this happens if all data in
the segment is overwritten before the segment is reclaimed. In
the best case, then, the write cost can be 1; i.e. the entire
write was used in the new data.
Here we define the transfer
in-efficiency as
Transfer in-efficiency =
Segment Transfer Time actual / Segment Transfer Time Ideal
Access Time + Segment Size / Disk Bandwidth
Transfer in-efficiency = ----------------------------------------------------------------
Segment Size / Disk Bandwidth
Disk Bandwidth
Transfer in-efficiency = Access Time * ----------------------------------------------- + 1.
Segment Size
So the overall cost will be ...
Transfer Time total
= ---------------------------------------------------
Transfer Time ideal
5.2 Effect of Segment size
The segment size plays an important role in the write performance of LFS than has been previously suggested. It was suggested that the segment size should be large enough to absorb the penalty of the seek to the segment. In Sprite LFS a relatively large 1-MB segment is used. On the other hand, there is a countervailing benefit to choosing a smaller segment size. It was observed that at smaller segments the variance in the segment utilization is larger; allowing the cleaner to chose less utilized segments. In particular, smaller segments can simply be declared clean without requiring any disk transfers by the cleaner.
In the limit, with one-block segments, cleaning costs would always be zero because all segments would be either fill or empty and no data would need to be compacted. Of course, single block segments would eliminate any advantage from batched transfers.
According to the original definition write cost, write cost is minimized at small segments because smaller the segment reduces cleaner overhead. However, this does not reflect the inefficiency introduced by transferring smaller segments. Transfer in-efficiency is the ratio between the actual segment transfer time to the time it would have taken to transfer the segment at full disk bandwidth.
As segments become large, the access time becomes insignificant relative to the time for transferring the segment, and therefore the transfer inefficiency approaches one. The write cost measures the overhead of cleaning, The transfer inefficiency measures the bandwidth degradation caused by seek and rotational delay.
The overall cost equation captures both these effects. The overall write cost is the total time required to write new data and clean segments, divided by the time to write the new data at full disk bandwidth.
If we consider disk
characteristics, against segment size at different disk
bandwidths, the overall write cost will be greater than 3 when
segment size is less than 16 KB or greater than 4 MB. The write
cost is minimum, when segment size is in around 256 KB. So at
this segment size the disk performance will be maximum.
5.3 Cleaning policies
The final parameters for the simulations selected the segment cleaning policy. The simulator allowed for control of the following policies
(1) The size and number of segments in the file system. The segment size ranged in values from 24 kilobytes up to 8 megabytes. Unless otherwise specified a segment size of 2 megabytes (512 files) was used. The number of segments (size of the simulated disk) was adjusted so that there were always 200 megabytes (524288 files) of data.
(2) The number of live files read at a time during cleaning. The reading step of the cleaning mechanism continues until it has read this number of live files into memory to be combined and written out during the write-back step. The motivation behind the number is that LFS reserves space in the cache for cleaning. This number specifies the maximum amount of space reserved in the cache. If not specified then 10 megabytes (2560 files) were read during a cleaning operation.
(3) The policy used to select the segments to be cleaned. This policy is invoked when cleaning is started and is used to select which segments are to be read in and combined. Two policies were implemented: the greedy and cost-benefit policy. The greedy policy chooses the segments with the least number of active bytes according to the segment usage table. The cost-benefit policy uses a function of the age and amount of data in a segment. Unless specified the greedy policy is used.
(4) The policy specifying the order in which live data is written back to disk during cleaning. The two choices were to combine the live files in the order in which they were read from segments during cleaning or to group the files by last modify time.
Because of the large number of
possible policy and workload combinations it was infeasible to
study all combinations. The approach taken for this thesis was to
study the effects of varying each cleaning policy individually.
The goal was not to use the simulator to predict performance but
to understand the reason behind the changes in write cost due to
varying the policies. To further aid in understanding, I modified
the simulator to graphically represent the state of the disk. As
the simulator ran it produced an animated view of segments and
allowed the effects of policy changes to be visualized.
5.4 Cleaning Algorithms
The cleaning used normally
degrades the performance, so we have to use the cleaning
algorithm such that it will least affect the performance. Traditional
LFS cleaning is used when disk utilization is low. But at
higher disk utilizations, depending on usage pattern different
cleaning methods are used like Hole-plugging.
In traditional cleaning, the live blocks in several empty segments are combined to produce a new segment, freeing the old partially empty segment for reuse. In many environments traditional cleaning works well. Normally this cleaning is done when system is idle. For the workloads examined, 97 % of the cleaning is done in background. Observation shows that cleaning cost is relatively low when disk utilization is less than 80 %. If segments updates show a higher degree of locality, then some segments will be emptier than other and will yield more free space when cleaned.
The problem with cleaning appears
at higher disk utilizations, mainly for the workloads with many
random updates and insufficient idle time. Because segments do
not have a chance to empty before they must be cleaned. In order
to free one segment cleaner have to process many nearly full
segments. Each segment must be read, and all but the few holes
rewritten into a new segment. This translates high write cost. In
an extreme case, the entire disk might need to be cleaned in
order to make a new single free segment.
In hole-plugging,
partially empty segment are freed by writing their live blocks
into the holes found in other segments. In order to produce one
free segment worth of space, we need only read one segment and
rewrite each of its live blocks. These writes are more expensive
per block than writing complete segment because each block
requires additional seek and rotational delay. However, despite
the higher per block cost, at higher at disk utilization, hole-plugging
is still better than traditional cleaning because we avoid
processing so many segments.
Segment size =
4 blocks
New Free Segment
Live block of segment
Figure 6: Hole-plugging: Above figure shows some live block from one partially full segment are transferred to other partially full segment. This is very useful here as disk utilization is high.
But at lower
disk utilization larger cost of writing individual blocks makes hole-plugging
more expensive than traditional cleaning. So we have to use
different cleaning algorithm depending upon disk utilization.
Chapter 6
Summary and
Conclusion
This Seminar presents a new disk
storage management technique called a log-structured file system.
The basic principle is simple: collect large amounts of new data
in a file cache in main memory, then write the data to disk in a
single large I/O that can use all of the disks bandwidth.
Also presented was the
cost-benefit segment selection policy that enables the
log-structured file system to exploit locality to further improve
performance. Finally, it provides a software technology to help
magnetic disk storage track the improvements in CPU technology so
future systems can achieve higher performance.
Even if one of the other
approaches, such as file caches with battery backup, proves
better than log-structured file systems, we think that the nature
of I/O to disk is changing enough to justify new disk
organizations. Logging approaches offer easier recovery and
versioning and better locality than most of todays file
systems, so some of the logging techniques may be applicable as a
supplement to other approaches.
Bibliography
[1] Christian Czezatke. DTFS: A Log-structured File System for Linux, August 1998
[2] Mendel Roseblum. The design and implementation of a Log-structured File System. Masters thesis, University of California, Berkley, January 1992.
[3] John Ousterhout, Fred Douglis. Beating the I/O bottleneck: A case study of Log-structured File System. February 1994.
[4] Uresh Vahlia, "UNIX Internals", Prentice-Hall, New Jersey, 1996, Chapter 11, pp. 338-350.
[5] Jeana Neefe Matthews Drew Roselli Adam M. Castello Randolf Y. Wang and Thomas E. Anderson. Improving the performance of Log-structured File System with adaptive methods. In Sixteenth ACM Symposium on Operating System Principles. ACM, October 1997.
[6] Tage Stabell. Security and Log-structured File System. ACM Operating System review, Vol. 31, No. 2, pp. 9-10, January 1997.
[7] John Ousterhout and Mendel
Rosenblum. Design and implementation of Log-structured File
System. July 1991.
Web-Sites:
[1] http://www.complang.tuwien.ac.at/czezatke/
[2] http://collective.cpoint.net/lfs/
[3] http://metalab.unc.edu/linux-bin/
Mail your
Suggestions, Comments, Doubts, Questions to author@abhishikt.8k.com
$Id: LFS.htm,v 1.4 2000/02/29 18:31:46 root Exp root $