VMFS Corruption and recover

Mar 3rd, 2010

These days I’ve dealt with a big problem. Consider an infrastructure where you have some ESX4 connected with 4 paths to a MetroCluster NetApp 3040. For some not really known reasons out storage came out with the following error:


Wed Feb 10 10:58:12 CET [nas1b: scsitarget.fcp.dump.warning:warning]: FCP target SRAM dump disabled for adapter 0b, isp2400fct_process_ctio: Invalid CTIO Status: S_ID: 10600, OX_ID: 4F, Status 0x8 (INVALID_RX), Cmdblk 0x0000000012620a00, state Wed Feb 10 10:58:12 CET [nas1b: scsitarget.ispfct.errorReset:CRITICAL]: Error processing target scsi command on Fibre Channel target adapter 0b. Resetting the adapter to clear INVALID_RX condition.
Wed Feb 10 10:58:12 CET [nas1b: scsitarget.ispfct.reset:notice]: Resetting Fibre Channel target adapter 0b.

This for all four paths. A complete resets. Obviously VMWare didn’t understand the situation and from its own side reported lost connectivity to the datastore. Which was the problem? Vmware lost some opened files. Worse. Some files and directories were corrupted. Also was not prossibile to remove them due to a filesystem inconsistency.

We’ve called VMWare that confirmed us that there’s no way to solve the issue. We had to remove datastore and create a new one. VMFS does not provide, publicly, a way to make a consistency check. Also, VMFS  and VMWare technology does not commit data always, so in case of connectivity lost many files will result opened and not corrupted.

Looking around there’s a feature, available with argv -R, that should check a VMFS. But reading more, this works only to unlock locked files after a crash of an ESX server. Does not repair consistency problem. Since VMFS is a proprietary filesystem I’m not able, and maybe there’s not, a document describing itself to understand how and why you may find it corrupted.

Tags:
  1. Ben Hale
    Mar 25th, 2010 at 15:15
    Reply | Quote | #1

    We have the same issue with our ESX4 and NetApp 3020. We are working with NetApp to figure out what the cause of the issue is. Have you seen a resolution?

  2. Stefano
    Mar 25th, 2010 at 16:18
    Reply | Quote | #2

    Check your HBA firmware version. Is there any update? Are these QLogic o Emulex?

  3. Natalie
    Mar 26th, 2010 at 19:48
    Reply | Quote | #3

    We’ve had the same problem for the last two weekends. VMware hasn’t responded at all to my service request. Netapp began requesting information today after I contacted my local Netapp rep and he escalated it to ‘hot case’. Netapp is having me run nSANity to collect logs from the host, the fabric and the storage system and to run perfstat from 1-2 a.m. Sunday morning since our HBA reset has been happening at 1:20 a.m. on Sunday.

    I need a resolution badly. Management is not happy.

  4. Stefano
    Mar 26th, 2010 at 19:50
    Reply | Quote | #4

    Which kind of problem? May you specify?

  5. Natalie
    Mar 27th, 2010 at 03:26
    Reply | Quote | #5

    The FCP sram dump error that causes VMFS datastore corruption. I’ve spent the last two Sundays rebuilding virtual machines – some from clones, some from backup and unfortunately a couple of them from scratch. I’m on Netapp FAS3140 and vSphere 4 update 1a.

  6. Lee
    Apr 30th, 2010 at 12:59
    Reply | Quote | #6

    We are getting the same corruption on ESX4 and Netapp Ontap 7.2.4. We’ve been told that this configuration is not supported and resolved in 7.3 but something is majorly wrong to cause this. Did anyone find an interim fix and its cause, we are not planning to upgrade Ontap for a few weeks and its on a knife edge.

    Thanks

    L

  7. Stefano
    Apr 30th, 2010 at 13:09
    Reply | Quote | #7

    I guess no interim fix are available. Netapp in my case didn’t understand the problem. Seemed to be not a real problem. *FUCKING* storage

  8. Natalie
    May 18th, 2010 at 05:28
    Reply | Quote | #8

    Netapp had me run perstat and nSANity for them to review. They found the FCP scram dump was occurring because of high network utilization during our backup window so we’ve been able to resolve that. They also found that I needed to do mbralign with their FCP utilities santools on all my vmdk’s. That took weeks of after hours time to get done but I haven’t had any vmdk corruption since I finished doing that.