[ale] Very bad kernel OOPSes.

Sat Jul 26 15:43:01 EDT 1997

Hello, ALErs.

I've got a 486 that I've been running Linux on without any trouble for
about four years now. However, a few weeks ago I made some hardware
changes, and things have begun to act screwy. Perhaps someone has
had a similar experience, or knows more about hardware issues than
I; any advice is appreciated.

The symptom is that after the machine has been up and running for
about two days, processes start to die with very bad looking OOPS
messages, among them:

"Can't handle kernel paging request at <address>"
"General protection 0000"
"Can't handle kernel null pointer dereference at <address>"

When the problem manifests itself, this happens as soon as I try to
start some particular executable. Sometimes it's Maelstrom, sometimes
it's ping, sometimes it's popclient, sometimes it's more than one of
the above. Any other executable I try works OK; it's always a
particular program or set of programs that triggers the problem, just
not always the same one(s). But everything works fine for at least a
day or two before this starts to occur. Also, when this problem
occurs, memory usage is not necessarily particularly high. The last
time it happened, I couldn't run ping or popclient or lynx, but I
could start xemacs without any trouble, and top claimed that only
10M out of the 24M of physical ram was in use; swap usage was
something like 15M out of 64M.

I can think of two things that may have precipitated this, and I'd
like y'all's opinion as to which is the most fruitful place to search.

(1) I installed 16M of ram, in two 8M 60ns SIMMS. I removed a pair of
4M 60ns SIMMS, but left in 8M worth of 1M 80ns SIMMS. (Before you say
"Duhh", I needed the 60ns chips for a Cyrix 686 machine I just
aquired, which doesn't like the 80ns chips.) All the RAM passes its
power-on test OK.

(2) I installed an old Seagate 42100 2G SCSI HD, which I got for
free. It seems to work fine, although it is the size of a small
refrigerator. I'm not sure I have it terminated correctly, because I
have inadequate documentation, and Seagate doesn't support this drive
anymore on its web site. There are four jumper pins on the drive that
look relevant, labelled as follows:

1 3
2 4

3&4 = "Term pwr to bus"
1&2 = "Term pwr from drv"
2&4 = "Term pwr from bus"

At the moment the "Term pwr from drv" jumper is set. Any idea how
these should be set? I don't know enough about SCSI to know if
termination problems could ever cause this kind of bad behavior. BTW I
can also talk to my T4000 SCSI tape fine; it was previously the only
device on the chain, and I removed its termination resistors before
adding the Seagate to the end of the chain. I'm using a cheapo
53C810 SCSI card, which I've heard is supposed to be very tolerant
of termination problems (which may mean, "It'll only fail every
couple of days.")

Finally, I'm running an unpatched 2.0.27 kernel.

Once again, TIA for any advice.

- Joe Knapka (aka "Clueless Joe")