[ale] what exactly does a long smart hdd test do?

mike at trausch.us mike at trausch.us
Sat May 12 17:55:29 EDT 2012


On 05/11/2012 09:09 PM, Ron Frazier (ALE) wrote:
> Hi Mike,
> 
> I don't think that discredits the project, I think it's a wise design.
> Here's why. The most recent version of SpinRite came out in 2004 and has
> a history going back to 1988. The program designer is planning an
> update, but it's not out yet. Nevertheless, that version will work on
> modern drives. I don't know how deeply SpinRite interacts with the bios.

If it has the limitations of BIOS, then it is using the BIOS functions
as published in Ralf Brown's Interrupt List (RBIL), which was (and for
the real-mode programmer, still is) the definitive source of BIOS
interrupt interfaces.  The BIOS supports only a limited set of
functionality, which has been extended over the years to cater to larger
drives and so forth.  It supports only limited error handling (for
example, INT 0x13/00 is "RESET DISK SYSTEM", but that only seeks the
drives to track 0, it doesn't actually perform a bus reset).

The INT 13 interface is horrible, and unsuitable for the implementation
of any serious software these days.  And has been for many years, which
is why modern operating systems no longer use that interface.  Even
Windows 3.11 had a dedicated driver for bypassing the BIOS routines for
disk access (they called it "32-bit disk access" or something along
those lines, IIRC).

> I do know it boots from its own copy of freedos and runs from there. The
> product was designed to work with any PC compatible computer, be it PC,
> Linux box, or Mac. It can even work with Tivo or iPod drives, etc. if
> you take the drive out and attach it to a PC. It needs to be able to

Well, yes.  They all speak the same language, regardless of the type of
computer they are plugged into.  The only thing that is really different
between disks in a Mac and a PC is the convention used to store data
(e.g., partition tables or maps and filesystems).

> have total control over the drive, including disabling some of the
> drive's normal error correction, so it can do analysis and detect
> problems. It can't have the OS in the way and interfering with it's
> operation. The primary target machine it was designed to run on was no

... but it's perfectly happy having the limited, inconsistently designed
various BIOS programs get in the way and do its work for it?

I'll grant that the Linux generic SCSI interface did not exist when
SpinRite was first created.  I'll grant even further that it was
impractical in that time period to actually write dedicated drivers for
the various disk controllers which existed at the time.  The reason that
BIOS existed was to simplify the creation of relatively simple systems
so that they did not need to know anything more than the generic BIOS
interface.

However, with that generic BIOS interface comes a
lowest-common-denominator approach to handling the disk controller and
therefore the disks themselves.  This would also be the primary reason
why it's a horrible idea to use the BIOS interface.

> doubt Windows machines. Back in 2004, those machines were running
> various combinations of Win 95, Win98, Win ME, Win 2000, Win NT, and Win
> XP. I don't think any of the Windows systems allow the kind of
> unfettered access to the drive that SpinRite needs. Also, as far as I

Windows systems not built on the Windows NT kernel (e.g., Win9x and
earlier) do allow direct hardware access because at their core they were
still 16-bit operating systems.  Calls to the Win32 API were largely
thunked to 16-bit modules that were preexisting.  There were large
components of the system that ran in 32-bit mode, but a lot of it did
not.  There was therefore a heavy cost associated with the constant
changing of the CPU mode as part of context switches and so forth, which
made Win9x both clunky and relatively unstable.

Starting with Windows NT, direct hardware access is prohibited to most
applications.  The exceptions are those that create kernel drivers that
enable an application to bypass such restrictions.  I don't know if NT
has something like the Linux generic SCSI interface, but if it doesn't
and it were necessary for an application to be implemented, it would
certainly be possible to do.

Building a utility directly on top of a SCSI interface would be far
superior to building it on BIOS.  Building it to talk directly to
hardware would be the only way to get closer to the metal than what SCSI
interfaces would allow, but that's unnecessary for most applications,
and if it were necessary for an application the preferred way to do it
would ideally be a 32-bit extended DOS program that performs direct
hardware access.  Of course a real-mode program would also work, but
there's little to no point to writing real mode code, since it's quite
easy to get a 32-bit DOS compiler anyway (GCC, for example, supports
MS-DOS via DJGPP).

> know, there isn't a way to dismount an internal drive in Windows and
> work on it, as you potentially can in Linux. Even if you could dismount
> a drive, the system needs to run on the system drive, and the average
> user doesn't have any way to boot a Windows machine without doing so
> from the system drive. So, the best design choice was to make a product

You can absolutely unmount drives in Windows.  Use the Disk Manager in
the Microsoft Management Console for a GUI way to do it.  There are of
course API calls that can be used by custom software to do it as well.

You are right in that you cannot unmount the system drive, but that
problem is common to all operating systems, not just Windows.  DOS had a
sort of exception, but DOS was also small enough to remain completely
resident in RAM when there was more than 1 MB of memory present and usable.

> that booted itself. That way, the OS isn't running, all the drives are
> dismounted, and he didn't have to wonder whether the user would be able
> to boot their pc so they could use his software. It was probably the
> best solution to the problems he had to deal with. On my machines where
> the bios is new enough to match the hard drive capacity, I can run

Most, if not all, modern BIOS firmware provides what are known as the
INT 13 extensions, which provide a portable interface to address up to
one or many different sizes.  There was a series of progressions in the
maximum disk size that BIOS could support; 500 MB, 1 GB, 4 or 8 GB, 128
GB and I don't remember the current one but it is pretty large relative
to the time period it was introduced in.  Maybe 500 or 800 GB or something.

Anyway, given a disk that fits within the support of the BIOS, you can
read and write sectors by identifying the C/H/S (very old API) or LBA
address of the first sector and the count of sectors.  The BIOS will of
course return an error condition in the CPU's registers if it
encountered an error while reading one of the sectors.

Just as a side note, DOS called upon BIOS to do the work, but
well-behaved programs that were written for DOS and did not require the
ability to go beyond DOS' support of storage would simply call DOS
interrupts to get the job done, which provided a slightly higher-level
API since you did not need, for example, to worry about sectors but
instead clusters in named files.

SpinRite doesn't need to use any of the DOS filesystem facilities,
though, and DOS is inherently a single-tasking operating system, so it
is safe for SpinRite to assume that it can have exclusive control of the
disk while it is running.

Also note that there are DOS implementations that have Windows NT-like
restrictions on direct hardware access for certain things.  Such
versions are usually ones that provide some sort of task switching or
multitasking ability, for example taking advantage of the functionality
and features of the 386 or newer CPUs and providing DOS applications
with a V86 environment instead of a real one.

> SpinRite on both the Windows partitions and the Linux partitions. It
> doesn't care. It works strictly at the sector level and is non
> destructive. Even to use the badblocks command as you and Jim have
> suggested on my old PC, I have to shut it down and boot a foreign OS, ie
> a live Linux CD, in order to run the test. That's exactly the same thing
> SpinRite is doing. It just happens to be booting freedos rather than Linux.

No, you can do it while the system is running, you just cannot use the
read-write test mode.

The read-write test mode in badblocks is superior because it does not
depend on the data currently stored in the sector.  Certain data
patterns can hide error conditions which may exist on the platter; for
example, a particular bit may be stuck "on" or "1", but you'd never know
that if the value that is there is legitimately 0xFF.  But you'd detect
it if you wrote 0x00 there, and when you read it back it was, for
example, 0x01 or 0x80, because the stuck bit didn't get cleared.

SpinRite does this at the disk sector level (presumably, since it is an
ancient program, with a fixed value of 512 for the sector size).  The
badblocks command works on blocks, too, but you can specify the size of
what it considers a block.  A common value is 4,096 bytes for a block
when running badblocks, though virtually any block size that is a
multiple of 512 will work for older drives; make that a multiple of
4,096 for modern so-called "advanced format" drives.

> I may run the non destructive rw test on the old pc using badblocks as
> you and Jim suggested in other messages. It already passed the long
> smart test and it says the drive is healthy with no bad sectors. I just
> have to figure out how much additional time I want to spend on it.

For a non-critical (e.g., personal) system, SMART should be sufficient.
 You of course take regular (monthly or weekly) backups of your ${HOME},
right?  If that's the case then recovery is possible within hours of new
drive installation, and for an individual, particularly one who has
multiple computers, that is an acceptable thing.  If not, you can employ
other steps to try to delay or defer the restoration process, such as
using RAID, but backups are backups and still quite necessary.

I have one array I manage that it would take approx. 30 hours to restore
from backup.  In an attempt to avoid that in all but the most
devastating situations, I have it on a RAID array.  As long as the RAID
array's health is maintained, it is possible for me to keep taking and
testing backups and knowing that I can, if need be, recreate the
configuration as it exists in the office today along with all the data,
but I don't have to if only one or two drives fail, because I can
replace them almost immediately upon failure.  I'm a bit on the paranoid
side, too, I will start replacing drives as soon as they stop behaving
100% perfectly as opposed to wait for a hard failure.

So far that seems to have been a good way to ensure that things stay
running... the drives I've removed were all used again in my home
(after, of course, being wiped) and failed within a month of removal
from the array.  One of them failed only two days after removal.  Even
more than smart, the Linux kernel ring buffer is a great source to
monitor for disk trouble, especially on a system that has many disks
with lots of activity on them.  The kernel will notice errors as soon as
they are encountered.

Additionally, RAID devices get "scrubbed" once per month anyway (at
least by default on Debian and Ubuntu).  The "scrub" process is
*exactly* what SpinRite does, reading everything on all the disks.  It
doesn't re-write every sector, but it doesn't need to: if a sector
cannot be read it is reassembled from parity information and an attempt
is made to re-write it.  If the drive was able to re-map the sector, the
write will work and things continue.  If the drive was unable to re-map
the sector (say, because it ran out of sectors in the spare sector area)
then the write will most likely fail and the disk will be marked
"failed" by the virtual RAID controller.

That's robust enough for me, at least with the requirements I have in
the environments that I am managing for the moment.  I would like,
however, to have a beefier system for the RAID.  Not because of the CPU,
but because it would really do well to have a veritable buttload of RAM.
 A lot of the operations would be faster if the system had 4 GB of RAM
to use for caching and buffers...

	--- Mike

-- 
A man who reasons deliberately, manages it better after studying Logic
than he could before, if he is sincere about it and has common sense.
                                   --- Carveth Read, “Logic”

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 729 bytes
Desc: OpenPGP digital signature
Url : http://mail.ale.org/pipermail/ale/attachments/20120512/627cf2a5/attachment.bin 


More information about the Ale mailing list