[ale] ext2fs and non-fragmentation

Thu May 30 15:58:01 EDT 2002

Chris -

I really appreciate the explanation - this is really the first cogent
one I've seen even though I've seen the no-need-to-defrag statement made
in numerous places over the years.

In order to answer the "How does linux avoid fragmenting drives?"
question, the best short answer I can think to give is that the file
systems most commonly used with Linux (prefacing the statement with this
implies that there are many that CAN be used) have a longer evolutionary
history than NTFS and, as such, mechanisms to keep heavy fragmentation
from occurring in the first place have been in the kernel for some
time.  You could also add that the work was based on Berkely FFS in
concept (dropping "Berkely" in any statement tends to placate all but
the most vapid PHBs).

A question, though:  is there anything in Linux that monitors FS
structure and pushes things around for performance's sake independently
of actual process writes?

- Jeff

On Thu, 2002-05-30 at 10:34, Chris Ricker wrote:
> On 30 May 2002, Danny Cox wrote:
> 
> > Steve,
> > 
> > On Thu, 2002-05-30 at 07:03, sangell at nan.net wrote:
> > > Anyone know of a good site that explains in detail the how ext2fs avoids
> > > fragmenting disks, (or maybe you can explain it yourself). I am trying to
> > > replace some more MS servers and was asked the question in a meeting
> > > yesterday, "How does linux avoid fragmenting drives?" and to be quite
> > > honest I couldn't answer, and to some "It just doesn't!" is not a
> > > sufficient answer. I tried a few searches on google but the previous answer
> > > was about all I could find.
> > 
> > 	One main concept works on the idea of allocating inodes/data within
> > "cylinder groups", keeping the data and meta data together.  When
> > growing a file, if it can find the necessary room in the current
> > cylinder group, it uses that.  Only when it's full does it change to
> > another, which becomes the "current" cylinder group for the next
> > allocation.
> > 
> > 	It *does* eventually fragment badly, but only when the FS is 95% (or
> > some magic percentage), and then it *really* slows down.  That's why 10%
> > (or 5%) is only reserved for root.  Normal users can't use that last
> > little bit to really slow the system down.
> > 
> > 	So, the answer is: certainly files are fragmented, but usually within
> > one cylinder group, so next-block-lookup is still fast, and doesn't move
> > the head assembly too much.
> > 
> > 	As to where I saw this, it was long ago, in a collection of papers on
> > BSD.  The paper was entitled something like "Implementation of the (a)
> > Fast File System" or "The Berkely Fast File System".  So, looking on the
> > various BSD sites may get you further.
> 
>  
> All this is true for the FFS / UFS file system, and documentation of it,
> like you say, is in Kirk McKusick's papers ("A Fast File System for Unix",
> etc.).
> 
> ext2 is conceptually similar, but the terminology's different.  Check out
> /usr/src/linux/fs/ext2/ialloc.c to see how fragmentation of directories and
> inodes are handled, and /usr/src/linux/fs/ext2/balloc.c to see how
> fragmentation of data blocks is handled.
> 
> The basic structure of ext2 is that the fs is divided into block groups
> (these are basically the same as McKusick's cylinder groups, with the
> difference primarily being that cylinder groups are based on real or, these
> days, imagined disk geometry, while block groups don't even pretend to
> correspond to the underlying physical structure).  Each block group contains
> a map of its blocks and a map of its inodes.  When a new normal
> (non-directory) inode is allocated, ext2 just grabs a free inode from the
> inode map for the block group of that new file's parent directory (ensuring
> that directories and their contents are co-localized on disk, so directory
> lookups will be quick).  When a new directory inode is allocated, ext2
> searches for the nearest block group which has both lots of free data blocks
> (so that the directory can grow in the future w/o fragmenting) and which has
> a low number of existing directories (giving each directory local room to
> grow, so that normal inode allocation can be done w/in the same group).
> 
> When allocating data blocks, ext2 behaves similarly.  If it's growing an
> existent file, it looks for adjacent blocks (which were pre-allocated when
> the file was created; see next sentence).  If it's a new file, it looks for
> a large contiguous group of free blocks w/in the file's inode's block group,
> and then creates the file there, allocating the needed blocks and
> pre-allocating the adjacent blocks so the file can later grow locally.
> 
> This doesn't give you 100% non-fragmented file systems, and as Danny
> mentioned, the fragmentation does increase as the file system fills, since
> ext2 can no longer cluster inode and block allocations so that files don't
> fragment (contrary to popular opinion, the 5% reserved for root is for
> performance reasons, not for security reasons).  In practice, though, it's
> Good Enough.  There are ext2 defragmentation tools kicking around, but 
> no one uses them because the problem's never that bad.
> 
> > 	If you can get him to respond, contact Ted Tso (see the MAINTAINERS
> > file in /usr/src/linux), and he may point you to some useful
> > information.  Then again, he may not. ;-)
> 
> Ted Ts'o also maintains a web page about ext2.  I don't have the URL handy, 
> but I'm sure Google does ;-).  I think it had a couple of white papers about 
> ext2 on it....
> 
> later,
> chris
> 
> 
> ---
> This message has been sent through the ALE general discussion list.
> See http://www.ale.org/mailing-lists.shtml for more info. Problems should be 
> sent to listmaster at ale dot org.
> 

---
This message has been sent through the ALE general discussion list.
See http://www.ale.org/mailing-lists.shtml for more info. Problems should be 
sent to listmaster at ale dot org.