Acontece que ha algusn dias comecei a ter falhas neste HD secundario onde ele ficava como ready only, ate que começou a ser frenquente e fiz a troca do HD.
Depois de alguns dias(esta semana) começou o mesmo problema novamente.
Todo dia as 4hs da manha o HD passa a ficar como ready only, notei que o load fica em 30.0, ja chquei cron e nao achei nada suspeito que pudesse fazer isso.
Sera falha na crontroladora? Porque o servidor tem uma controladora RAID mas nao uso os discos em RAID.
Segue log de hoje:
Nov 4 04:17:28 XXXX kernel: mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
Nov 4 04:17:30 XXXX kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
Nov 4 04:17:46 XXXX kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
Nov 4 04:18:02 XXXX kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
Nov 4 04:18:15 XXXX kernel: mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
Nov 4 04:18:18 XXXX kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
Nov 4 04:18:19 XXXX kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
Nov 4 04:18:23 XXXX kernel: mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
Nov 4 04:18:31 XXXX kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
Nov 4 04:18:48 XXXX kernel: mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
Nov 4 04:18:52 XXXX kernel: mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
Nov 4 04:18:54 XXXX kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
Nov 4 04:18:59 XXXX kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
Nov 4 04:19:39 XXXX kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
Nov 4 04:20:40 XXXX kernel: INFO: task kjournald:5263 blocked for more than 120 seconds.
Nov 4 04:20:40 XXXX kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 4 04:20:40 XXXX kernel: kjournald D 00004E8F 2872 5263 11 10967 1678 (L-TLB)
Nov 4 04:20:40 XXXX kernel: f26aef40 00000046 984fe6bd 00004e8f 00000000 00000000 00000000 0000000a
Nov 4 04:20:40 XXXX kernel: f2681aa0 984fec34 00004e8f 00000577 00000001 f2681bac c2013ac4 f7af83c0
Nov 4 04:20:40 XXXX kernel: 00000400 00000000 f7d55700 c20109c4 f7af83c0 f7d55550 f26aef70 ffffffff
Nov 4 04:20:40 XXXX kernel: Call Trace:
Nov 4 04:20:40 XXXX kernel: [<f887609b>] journal_commit_transaction+0x137/0xeec [jbd]
Nov 4 04:20:40 XXXX kernel: [<c043607b>] autoremove_wake_function+0x0/0x2d
Nov 4 04:20:40 XXXX kernel: [<c042d6f4>] try_to_del_timer_sync+0x65/0x6c
Nov 4 04:20:40 XXXX kernel: [<f8879c11>] kjournald+0xa1/0x1c2 [jbd]
Nov 4 04:20:40 XXXX kernel: [<c043607b>] autoremove_wake_function+0x0/0x2d
Nov 4 04:20:40 XXXX kernel: [<f8879b70>] kjournald+0x0/0x1c2 [jbd]
Nov 4 04:20:40 XXXX kernel: [<c0435fb7>] kthread+0xc0/0xed
Nov 4 04:20:40 XXXX kernel: [<c0435ef7>] kthread+0x0/0xed
Nov 4 04:20:40 XXXX kernel: [<c0405c53>] kernel_thread_helper+0x7/0x10
Nov 4 04:20:40 XXXX kernel: =======================
Nov 4 04:20:42 XXXX kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
Nov 4 04:22:44 XXXX kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
Nov 4 04:22:45 XXXX kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
Nov 4 04:22:54 XXXX kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
Nov 4 04:23:20 XXXX kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
Nov 4 04:24:07 XXXX kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
Nov 4 04:24:42 XXXX kernel: sd 0:0:1:0: SCSI error: return code = 0x08000002
Nov 4 04:24:42 XXXX kernel: sdb: Current: sense key: Medium Error
Nov 4 04:24:42 XXXX kernel: Add. Sense: Unrecovered read error
Nov 4 04:24:42 XXXX kernel:
Nov 4 04:24:42 XXXX kernel: Info fld=0xc1850f3
Nov 4 04:24:42 XXXX kernel: end_request: I/O error, dev sdb, sector 202920179
Nov 4 04:24:42 XXXX kernel: EXT3-fs error (device sdb1): ext3_free_branches: Read failure, inode=12681431, block=25365014
Nov 4 04:24:42 XXXX kernel: Aborting journal on device sdb1.
Nov 4 04:24:42 XXXX kernel: EXT3-fs error (device sdb1) in ext3_reserve_inode_write: Journal has aborted
Nov 4 04:24:42 XXXX kernel: EXT3-fs error (device sdb1) in ext3_truncate: Journal has aborted
Nov 4 04:24:42 XXXX kernel: EXT3-fs error (device sdb1) in ext3_reserve_inode_write: Journal has aborted
Nov 4 04:24:42 XXXX kernel: EXT3-fs error (device sdb1) in ext3_orphan_del: Journal has aborted
Nov 4 04:24:42 XXXX kernel: EXT3-fs error (device sdb1) in ext3_reserve_inode_write: Journal has aborted
Nov 4 04:24:42 XXXX kernel: EXT3-fs error (device sdb1) in ext3_delete_inode: Journal has aborted
Nov 4 04:24:42 XXXX kernel: ext3_abort called.
Nov 4 04:24:42 XXXX kernel: ext3_abort called.
Nov 4 04:24:43 XXXX kernel: EXT3-fs error (device sdb1): ext3_journal_start_sb: Detected aborted journal
Nov 4 04:24:43 XXXX kernel: Remounting filesystem read-only
Nov 4 04:24:43 XXXX kernel: EXT3-fs error (device sdb1): ext3_journal_start_sb: Detected aborted journal
Nov 4 04:24:43 XXXX kernel: __journal_remove_journal_head: freeing b_committed_data
Nov 4 04:24:43 XXXX kernel: journal commit I/O error
Nov 4 04:29:52 XXXX kernel: __journal_remove_journal_head: freeing b_committed_data
Sera que é apenas um timeout mesmo? Ja segui um torial no wiki da loca web http://wiki.locaweb.com.br/pt-br/ERRO_-_Re...nly_file_system mas nao resolveu.
Entao hoje apos reboot sem montar o HD rodei fsck e corrigiu erros(ontem tabmem fiz isso) e hoje fiz o que pede na mensagem acima:
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sera que isso resolve? Vou ver la pelas 4hs, mas o que sera que pode ser isso? Esse server tem 1 ano ja e isso começou ha alguns dias apenas.
This post has been edited by Insert: 05 novembro 2010 - 12:36

Help











