[ale] Kernel panic

Mon Feb 13 12:28:26 EST 2006

My thoughts exactly on controller.  This system has 2 drives that are
raid1 via mdtools so I was not seeing disk death.  It was appearing as
if the whole disk system was disappearing.  The last time this happened
was in December and it caused ext2 corruption that propagated across the
mirror.  The only fix was a full reinstall.  Since I had no access
to /var/log/messages because of the corruption I had no clue what had
happened.  Now the box simply went south and after reboot I was able to
see the syslog file and pull that oops from the file.  I then saw many
other programs reporting messages to the syslog that they could no
longer open files.  That is how I knew the system was still up.

This oops occurred at 02-09 12:32 and I had messages in syslog up to 
02-10 15:00.   That is when the messages stopped.  So even after this
oops syslogd was apparently able to write to disk it just seemed like
every program that attempted to open() a file failed.  That in itself is
interesting because I would expect that if the controller failed that no
write()'s would have been successful either.

On Mon, 2006-02-13 at 12:15 -0500, Jeff Hubbs wrote:
> That's not a kernel panic - kernel panics, as a rule, say "kernel panic!"
> 
> The first line is a failed attempt to page, so, yeah, that smells like 
> disk or disk controller death. 
> 
> And, no, you might *not* get earlier messages on the syslog instead of 
> failure like this.
> 
> I saw much the same thing a few months back when a Dell PERC controller 
> died while working.
> 
> Jeff
> 
> Christopher Fowler wrote:
> 
> >The system keeps running but all attempts of anything to access any file
> >fails. All open()'s return with "No such file or directory".  But the
> >system does keep running.  Its useless but keeps going.
> >
> >On Mon, 2006-02-13 at 11:49 -0500, James P. Kinney III wrote:
> >  
> >
> >>run a memtest on your ram.
> >>
> >>This is a kernel panic. It halts the system. Once this happens there is
> >>no way to access anything without a reboot.
> >>
> >>On Mon, 2006-02-13 at 10:28 -0500, Christopher Fowler wrote:
> >>    
> >>
> >>>I get this error message:
> >>>
> >>>
> >>>------------------------------------------------------------------------------------
> >>>Feb  9 12:32:40 sam-accunet kernel: Unable to handle kernel paging
> >>>request at virtual address 625ae8e0
> >>>Feb  9 12:32:40 sam-accunet kernel:  printing eip:
> >>>Feb  9 12:32:40 sam-accunet kernel: 02164c7f
> >>>Feb  9 12:32:40 sam-accunet kernel: *pde = 00000000
> >>>Feb  9 12:32:40 sam-accunet kernel: Oops: 0002 [#1]
> >>>Feb  9 12:32:40 sam-accunet kernel: CPU:    0
> >>>Feb  9 12:32:40 sam-accunet kernel: EIP:    0060:[<02164c7f>]    Not
> >>>tainted
> >>>Feb  9 12:32:40 sam-accunet kernel: EFLAGS: 00010202   (2.6.5-1.358-SAM-
> >>>ACCUNET-001)
> >>>Feb  9 12:32:40 sam-accunet kernel: EIP is at proc_read_inode+0x4/0x29
> >>>Feb  9 12:32:40 sam-accunet kernel: eax: 16f68390   ebx: 16f68390   ecx:
> >>>00000000   edx: 022cb760
> >>>Feb  9 12:32:40 sam-accunet kernel: esi: 16f68390   edi: 21fb8200   ebp:
> >>>0cec6180   esp: 1e33de80
> >>>Feb  9 12:32:40 sam-accunet kernel: ds: 007b   es: 007b   ss: 0068
> >>>Feb  9 12:32:40 sam-accunet kernel: Process sendmail (pid: 1582,
> >>>threadinfo=1e33d000 task=213c47b0)
> >>>Feb  9 12:32:40 sam-accunet kernel: Stack: 00000000 21ff8910 02164e22
> >>>21ff8910 0cec6207 21ff8963 02166f6d ffffffea
> >>>Feb  9 12:32:40 sam-accunet kernel:        00000000 21fb0e10 21fb0e10
> >>>1e33df78 0cec6180 21fb3b80 02164f5e 022cb840
> >>>Feb  9 12:32:40 sam-accunet kernel:        0cec6180 21fb0e10 0214ad81
> >>>1e33df78 1e33df14 00000000 1e33df78 1e33df14
> >>>Feb  9 12:32:40 sam-accunet kernel: Call Trace:
> >>>Feb  9 12:32:40 sam-accunet kernel:  [<02164e22>] proc_get_inode
> >>>+0x59/0xdd
> >>>Feb  9 12:32:40 sam-accunet kernel:  [<02166f6d>] proc_lookup+0xb6/0xc4
> >>>Feb  9 12:32:40 sam-accunet kernel:  [<02164f5e>] proc_root_lookup
> >>>+0x2a/0x42
> >>>Feb  9 12:32:40 sam-accunet kernel:  [<0214ad81>] real_lookup+0x66/0xc8
> >>>Feb  9 12:32:40 sam-accunet kernel:  [<0214af4f>] do_lookup+0x43/0x72
> >>>Feb  9 12:32:40 sam-accunet kernel:  [<0214b494>] link_path_walk
> >>>+0x516/0x6e2
> >>>Feb  9 12:32:40 sam-accunet kernel:  [<02135342>] follow_page+0xda/0xe5
> >>>Feb  9 12:32:40 sam-accunet kernel:  [<0214b8b3>] path_lookup+0xf8/0x128
> >>>Feb  9 12:32:40 sam-accunet kernel:  [<0214bee9>] open_namei+0x93/0x3eb
> >>>Feb  9 12:32:40 sam-accunet kernel:  [<02135342>] follow_page+0xda/0xe5
> >>>Feb  9 12:32:40 sam-accunet kernel:  [<0214098f>] filp_open+0x23/0x3c
> >>>Feb  9 12:32:40 sam-accunet kernel:  [<02140cde>] sys_open+0x31/0x7d
> >>>Feb  9 12:32:40 sam-accunet kernel:
> >>>Feb  9 12:32:40 sam-accunet kernel: Code: 51 89 e0 e8 5a 62 fb ff 8b 54
> >>>24 04 8b 04 24 89 53 5
> >>>------------------------------------------------------------------------------------
> >>>
> >>>After I get this no program can access the file system.  Its as if the
> >>>disks have disappeared.   If it was a controller failure or some other
> >>>hardware failure would I not get earlier messages on the syslog instead
> >>>of failure like this? 
> >>>
> >>>_______________________________________________
> >>>Ale mailing list
> >>>Ale at ale.org
> >>>http://www.ale.org/mailman/listinfo/ale
> >>>      
> >>>
> >>_______________________________________________
> >>Ale mailing list
> >>Ale at ale.org
> >>http://www.ale.org/mailman/listinfo/ale
> >>    
> >>
> >
> >_______________________________________________
> >Ale mailing list
> >Ale at ale.org
> >http://www.ale.org/mailman/listinfo/ale
> >
> >  
> >
> 
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://www.ale.org/mailman/listinfo/ale