[ale] Samba: file corruption on write to share followed by hang

Jim Kinney jim.kinney at gmail.com
Tue Dec 15 13:13:21 EST 2009


I agree. There is no publicly available mechanism for an ECC memory error
event to make it to the kernel. When those events occur, there is (usually)
a way for that information to be stored in nvram on the affected dimm
itself. Accessing that area requires specialty kernel code that varies by
RAM maker. While I _have_ seen that code in operation, unless you work for a
large search engine that custom builds their own hardware and runs Linux for
everything, you're out of luck getting access to that data. And that code is
not publicly available.

That said, there may be some bios level processes that can analyze memory
faults. I know I've seen that on some older compaq and newer IBM hardware.

On Tue, Dec 15, 2009 at 10:01 AM, Michael H. Warfield <mhw at wittsend.com>wrote:

> On Tue, 2009-12-15 at 00:00 -0500, Jeff Hubbs wrote:
> > OK, but being ECC RAM, wouldn't something have shown up in
> > /var/log/kernel?  How could I tell other than using FSM-style faith?
>
>         I don't believe there's a specific interrupt or error upon memory
> parity or ECC failure.  I think it generates an NMI (Non Maskable
> Interrupt) but a lot of things could generate that error (Error:
> Unexpected NMI. Dazed and confused but trying to continue anyways).  I
> don't know if there's an indication in a memory controller somewhere or
> not about that.  Might depend on your hardware.  Obviously, once you
> take a non-recoverable memory hit, everything becomes suspect.
>
> > Jim Kinney wrote:
> > > Bad ECC RAM is still bad RAM. ECC can only correct a single bit flip
> > > in register. 2 bit flips and it's all toast.
> > >
> > > It does sound like Samba managed to totally corrupt itself and the
> > > hang later may have been related to the system thrashing ram around.
> > > The filesystem definitions are kernel space so samba has to access
> > > that to function. Just be restarting samba is a pretty good indication
> > > that it was memory associated with the samba process. The aggressive
> > > caching of the kernel will amplify a bad memory situation. Restarting
> > > samba will cause teh samba caching to also restart and that may have
> > > overwritten the bad data portion which was related to the filesystem
> > > management area.
>
>         Mike
>
> --
> Michael H. Warfield (AI4NB) | (770) 985-6132 |  mhw at WittsEnd.com
>   /\/\|=mhw=|\/\/          | (678) 463-0932 |
> http://www.wittsend.com/mhw/
>   NIC whois: MHW9          | An optimist believes we live in the best of
> all
>  PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!
>
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>
>


-- 
-- 
James P. Kinney III
Actively in pursuit of Life, Liberty and Happiness
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.ale.org/pipermail/ale/attachments/20091215/66307f00/attachment.html 


More information about the Ale mailing list