Sinking feeling: random data corruption

Discussion in 'Windows Server' started by Myrrh, Feb 16, 2011.

  1. Myrrh

    Myrrh MDL Expert

    Nov 26, 2008
    1,511
    627
    60
    #1 Myrrh, Feb 16, 2011
    Last edited by a moderator: Apr 20, 2017
    Guys,

    I'm hoping someone has heard of this before and knows what's going on. I have two systems with the following characteristics:
    Code:
    System 1 "production box":
    Gigabyte GA-X58-USB3
    Core i7-950
    24GB of G.SKILL Ripjaws DDR3-1333
    Adaptec 1430 SATA HostRAID adapter with WD2500YD (Raid) and WD1001FALS (black, WDTLER-ified) drives, two of each, in mirrors
    
    System 2 "test box":
    Lenovo 7522K6U (Pentium Dual Core E6300)
    4GB DDR2-800
    Seagate ST3250318AS
    
    On both systems:
    Windows Server 2008R2 SP1 Datacenter
    Hyper-V role installed
    File Server role installed
    The problem appears to be random data corruption when dealing with large files and seems to be related to cache of some sort. Let me cite a 100% reproducible example from the production box.

    • I build a new virtual machine and install it from an ISO file which I have verified has the correct MD5. My ISO files and the VHD are stored on the WD1001FALS drives. For the sake of discussion let's call it a "WZOR 7601 Windows 7 enterprise x64" install media.
    • The machine installs and boots correctly.
    • A few moments later, I build a second new virtual machine, install it from the same ISO file.
    • The install fails citing a corrupt install disk.
    • I check the MD5, it no longer matches. I get that sick feeling of panic. I notice when I run the hashchecker tool it reads the file extremely faster than normal.
    • If I leave it alone and wait an hour or two and check again, the MD5 of the ISO file has magically fixed itself.

    I have performed the same test on the "test box" and got the same results, which makes me think it is not necessarily hardware related. So far, I don't appear to have lost or corrupted anything permanently, but this has got me really worried.

    Additional observation: If I put each box through a normal shutdown, all the Hyper-V machines save state, and come back online when the box is turned back on. However, any machine using a very large amount of memory will very likely fail to restore citing a "memory corruption" error. This has only happened on the "production box" and I theorize this is because the "test box" doesn't have a large enough memory to trigger a corrupt save state file.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  2. 2centsworth

    2centsworth MDL Senior Member

    Feb 12, 2008
    333
    24
    10
  3. Myrrh

    Myrrh MDL Expert

    Nov 26, 2008
    1,511
    627
    60
    Those are some good suggestions, I will give all of that a try.

    Before bringing the machine online I let it run a memory test for a couple of days which found nothing though.

    My experience with Qualified Vendor lists has been that by the time I have my hands on the board, the list is pretty old and I can't find anything that is actually on the list, but if I buy something with identical specs it works fine. I have not specifically verified the memory speed, just went under the supposition that the chipset knows how fast it can run a given configuration and would automatically do that. There are a few BIOS options I probably need to go look at.
    The only common thing I can put my finger on is the host OS version, both were Win2008R2 RTM and were upgraded to SP1, which is why I'm wondering if I'm being bitten by a Windows bug of some sort.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  4. Myrrh

    Myrrh MDL Expert

    Nov 26, 2008
    1,511
    627
    60
    Yeah, my first thought was memory and overheating, I do in fact have 6 of the 4GB modules. I bought one of those Kingston Hyper-X things with the two fans on it blowing directly on the memory to reduce the heat. Heat gun said the temperatures on and between the modules dropped around 10 degrees Celsius.

    It's a server and I care about data integrity (hence everything being on RAID 1 mirrors) so I've not done any overclocking at all.

    The memory test was the one you can access from booting the Windows install disc. I'll give memtest86+ a try.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. 2centsworth

    2centsworth MDL Senior Member

    Feb 12, 2008
    333
    24
    10
    #6 2centsworth, Feb 18, 2011
    Last edited: Feb 18, 2011
    Unlikley the RAM is overheating, especially if it's gskill ripjaws (oversized sinks) 1333 running at 1066 and you have any decent airflow in the case. If it was overheating it would be too hot to grip with your fingers.

    Use CPUz or similar to look at what speed the DDR3 is running should be @ 1066.

    Qualifying DDR3 is different then the old days of DDR2. While your chips may be similar, unless they're tested and on the QVL it's a crapshoot at best.

    Only 2 modules listed for 4 GB and it's samsung under the 1066 list and hynix under 1333. Might email gigabyte and inquire what RAM runs in the density of 24 GB 4x6? They only list 2 chips for that density 4x6GB.

    Strange that one time you read off a drive the checksum matches, next time it does not yet later it does again (makes no sense). If you ran a RAM test that tests all the RAM and it passed, that just adds to the confusion and brings me back to the drive and it's integrity but if there are no reallocated sectors or other SMART errors.......well you have a real mystery. Can you roll back to pre SP1 and see if your theory of SP1 is to blame?

    Is there a BIOS update for the motherboard?


    Any other symptoms besides a single checksum error on an ISO file?
     
  6. Myrrh

    Myrrh MDL Expert

    Nov 26, 2008
    1,511
    627
    60
    The other symptom bit me last night. Got home and the machine was locked hard. The only thing that would get its attention was the power plug.

    While bringing it back up I took the opportunity to verify again that all the power management and "green" stuff is disabled, and disabled the "turbo boost" function of the Core i7 so it always runs at rated speed. My RAM was in fact already running at 1066 but I went ahead and bumped it down to 800.

    The same thing happened a few times months ago. This was before SP1 and before I got the new board, processor, ram and raid. It's been the same base OS and virtual machines the whole time though.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  7. 2centsworth

    2centsworth MDL Senior Member

    Feb 12, 2008
    333
    24
    10
    Locked up is not a good symptom. Definitely unacceptable for a server. Anything in the logs for a clue..? Running any appz/services may be to blame..? What power supply does it have?
     
  8. thomas-007

    thomas-007 MDL Novice

    Feb 20, 2011
    11
    1
    0
    Did the GB Easytune app get installed if so I would unistall it even if you do not run it.


    Did you load the Norton AV which comes with your board, if so " " also.

    Maybe try downloading a CPU temp program - One with a log for review.

    Maybe try re-seting the CPU with some quality thermal compound and not the stock flavor that came with the chip.

    What type of heat sink, the factory ones and even some of the aftermarket are machined pretty poor and require some lapping to be real efficent.

    One final thought maybe under clock your cpu and see what happens.

    A good app to stress your CPU and Mem is (PRIME5) KIND OF LIKE RUNNING SETI@HOME if there are any hardware issue you will know within 15 - 30 min of 100% processing. Be sure to run a temp app as well, so you can judge your cooling.
     
  9. Myrrh

    Myrrh MDL Expert

    Nov 26, 2008
    1,511
    627
    60
    Nope, none of the included crapware or trialware is loaded. A pristine 2008R2 with just the Hyper-V and File Server roles (and the hashcheck utility which alerted me of the transient "bad" files), nothing else.

    I am using the Intel boxed heat sink.

    All the suggestions so far have been good ones, along with one of my colleagues who suggested getting in touch with Adaptec as he has seen similar issues with some buggy drivers (I have the latest) on the higher end Adaptec controllers. I've not had the time to try any of this yet, just keeping an eye on it and rebooting as often as practical to keep it happy. Hopefully I can get to a proper diagnosis session with it this weekend.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  10. Myrrh

    Myrrh MDL Expert

    Nov 26, 2008
    1,511
    627
    60
    ok. Maybe I am finally getting somewhere with this. Had another hard lock last week so I decided to yank three of the six memory chips.

    One of the VMs bluescreened during its first boot attempt, and then the whole machine hard locked again within a couple of hours, before the raid arrays had even finished verifying.

    I yanked those three chips and put back in the first set; it has been running flawlessly since then for about four days.

    If this pans out, guess I need to start diagnosing those three chips and see which is bad. I know this board will run with four slots populated (at least I remember seeing that in the book), not sure about five.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  11. zephxiii

    zephxiii MDL Novice

    Aug 5, 2009
    6
    1
    0
    #12 zephxiii, Apr 2, 2011
    Last edited: Apr 2, 2011
    Oy, should be running a system with ECC memory if integrity is your main concern :-/

    Either way, NEED to run memtest86+ to see if any errors show, should have started with all modules installed, if errors showed up then test each module individually.