Hardware Canucks

Hardware Canucks (http://www.hardwarecanucks.com/forum/)
-   HardwareCanucks F@H Team (http://www.hardwarecanucks.com/forum/hardwarecanucks-f-h-team/)
-   -   Conquered my UNSTABLE_MACHINE after ~4 months :) (http://www.hardwarecanucks.com/forum/hardwarecanucks-f-h-team/15014-conquered-my-unstable_machine-after-4-months.html)

frontier204 February 18, 2009 03:25 PM

Conquered my UNSTABLE_MACHINE after ~4 months :)
 
Hi all,

I just wanted to share my troubleshooting story, in case someone else is having a problem with their ATI card in F@H. I was not able to fold ATI GPU work units since late October because of the "nonzero force sum on GPU" error.
Yesterday, my weekly attempt to solve this error actually worked! The key seemed to be performing the following steps to revert to Catalyst 8.12 from 9.1:
1. Run ATI Catalyst uninstall, uninstalled everything (including chipset drivers; I have an AMD rig). Restart.
2. Run Driver Sweeper on GPU drivers. Run CCleaner registry cleaner. Restart.
3. Install ATI Catalyst 8.12 chipset drivers (southbridge, then AHCI). Restart.
4. Install ATI Catalyst 8.12 GPU drivers (including Northbridge filter driver). Restart.
5. Copy all the amdcal*.dll from \syswow64\ to the FAH folder.

The rest of my "magic formula" is the following:

RIG and OS, other related software:
Phenom 9950 OC, 205 x 14 = 2.87 GHz, VID is 1.300V
Asus M3A78-T motherboard, modded with Pentium II CPU fan attached to northbridge
4 GB OCZ Fatal1ty RAM OC'd to 810 at 5-4-4-15, 2.1V
Windows Vista Business x64, SP1
Catalyst 8.12 drivers, for chipset, AHCI, IDE, and GPU
Running FAH off USB key
FAHGPU and 3x BOINC
CCC is active
CoolerMaster Elite 330 case modded with 4 fans: 1x 80mm intake, 1x 120mm intake, 1x 120mm exhaust, 1x 80mm exhaust
Seasonic M12 600W modular PSU

GPU:
Single Diamond Radeon 4850
- Stock single-slot fan at 60%, the label peeled off a few months ago due to heat hehehe
- Replaced all thermal compound with Arctic Silver 5 and Zerotherm compound (zerotherm for most chips, AS5 for VRMs because it is thicker)
- It seems I can leave the fan at its normal setting (which holds GPU temps at ~80C) and still complete WUs

FAH Client settings:
Console client (6.23)
Disabled CPU affinity lock
-forceasm
Priority higher than normal
UAC on, NO admin mode but XP compatibility mode is on
FAHCore_11 is allowed through Windows firewall (or else it would cause NANs error)
Copied all CAL DLLs from SYSWOW64: amdcalcl.dll, amdcaldd.dll, amdcalrt.dll.

From what I've experienced, if you see the "nonzero force sum on GPU" error coming up frequently on your FAH rig, DO NOT immediately suspect hardware instability. It's more likely than not an incompatibility with drivers or DirectX-hogging programs that you are also running. As a side note, when I intentionally OC'd my GPU to become unstable, I actually got the "NANs detected on GPU" error.

Hope this helps anyone else who is trying to troubleshoot GPU folding. I've had 100% success since 10:00PM EST yesterday. As an aside, I really hope Stanford can stop throwing code into their FAH cores at random, so debugging FAH will be a well-defined process rather than "black magic".

I'm afraid to restart my computer because the UNSTABLE_MACHINE might come back, but now Windows Update is bugging me to restart :shok:

EDIT: Revised my formula because it started EUEing again; maybe it just wanted me to poke the settings

frontier204 February 18, 2009 03:26 PM

I'll be running some experiments to see if of the following break FAH GPU:

Using all 4 cores with other BOINC / FAH CPU? - no problem
Move folding client back to HDD? - breaks FAH (I always EUE when I'm on the HDD)
Re-enable Aero?
Re-enable HP Printer drivers?
Re-enable sidebar (running All CPU Meter and RSS feed gadgets)? <Probably causing instability, got nonzero force sum>
Can a Systray be used? - haven't gotten a Systray to complete a WU yet

-------------------------
Other failures I'm seeing
-------------------------
Instant-falure NANs detected on GPU:
Saw this from a 4744 WU. I'm going to try running in admin mode with no Windows XP compatibility flag and see if the next 4744 will complete.

3.0charlie February 18, 2009 03:45 PM

Aero uses some GPU power, and may cause instability.

LCB001 February 18, 2009 08:25 PM

Great job of troubleshooting, I'm sure it will help others having getting the Unstable Machine error. I would never have thought to run F@H off of a USB stick, it will be interesting to see what happens when you go back to running on the HDD.

I'm very glad I don't get the problems others have had to contend with, hopefully Stanfords next core release will be a little more stable...

frontier204 February 19, 2009 03:38 PM

Thanks, I got the USB stick idea from the way that Vista loves locking down HDDs but gives you free reign over flash. (try to rename an HDD partition and you'll see what I'm talking about)

Sadly I just got a series of those instant failure "NANs detected on GPU" due to a 4744 work unit, and after I restarted the computer I got another nonzero force sum. There are 3 variables in play there, sidebar, HDD, and running with admin mode, so I'll try to isolate each one.

frontier204 March 1, 2009 05:36 AM

...well that was short-lived...
 
Back to impractically high EUE rates on my GPU, to the point where running a single cpu client can out-PPD my GPU. I'm beginning to suspect hardware instability because my IGP was stable (although not fast enough to justify using it).

I won't buy another GPU because it seems the GPGPU "gods" hate me for the time being. My Radeon 4850 has its good and bad days, while my nVidias (8400M GS, 8500 GT) EUE'd on the first WU attempt I tried with them (not to mention they're too slow to be practical).

I'll OC my CPU more in an attempt to compensate, since SMP client has never failed on me yet :thumb:

Alwaysrun March 1, 2009 08:59 AM

Frontier your a real folding trooper dude. Awesome troubleshooting and some great advice. I to had trouble with my GPU recently and tried so many things to rectify the situation that not until I tore down my rig and changed the pcie power cables and did a complete wipe of my HD and fresh install of Vista 64bit did I get rid of those damn EUEs.

I dont use Aero, sidebar, or Auto hide the taskbar. All of which I've read muck with the video cards performance to one degree or another. When my windows loads it's as naked as a newborn..Network connection and my sound volume is all that load. Absolutely no startup programs auto load and I've gone through every single process that runs and I was really suprised just how many I could get rid of that are unneeded that windows crams down your throat straight off the bat. I even went so far as not installing my printer, usb camera, and joystick. This baby runs smooth as silk now and for the life of me I can't pinpoint which thing I did or didn't do to stop the Nans detected on GPU but it's finially stopped! thank gawd. I really loved your flash stick idea I just wish I had one myself.

I've pretty much given up on the stanford folding forums as the knee jerk reaction from most of the experienced folders there is that it's always a bad video card and they dogidly insist it's a hardware malfunction. I reject their asumptions as I keep in mind both the SMP and GPU2 programs are both poorly written and have rudementary option sets and have been in beta for 3 year...3 years! omfg there must be one capable programmer at Stanford or out of the 1.2 million folders in the world to write a god damn FINAL release.

Anywho... Enough rambling from me hehe. 4 months and your still sticking with it and I salute you frontier your an inspiration!! :thumb:

Shadowmeph March 1, 2009 11:33 AM

Good detective work .
Some day I would like to build a machine for folding and I am curious about why are you using using Windows Vista Business x64, SP1 doesn't vista use more resources and I would imagine that the business version would even use more. Don't get me wrong I am not saying that there is anything wrong with your operating system of choice I am just wondering why Vista.

frontier204 March 1, 2009 02:58 PM

I use Vista Business x64 because I got it for free :P (MSDNAA) Then again, MSDNAA also gives me Windows XP Pro / Server 2003, but I don't like to limit my GPU to DirectX 9 mode if I use the computer to game.

Back when I was folding on my Dell with the same GPU, I used Vista Home Premium x64 before and Windows XP MCE, but I noticed that the performance was pretty much the same - the ATI FAHCore11.exe gobbles up an entire core whether I am running Vista or XP. The nVidia core is a different beast, and at least folding with a low-end card, it barely used any CPU.

Thanks all for the complements! I'm still not sure why I always try to troubleshoot this F@H program each time a new driver or FAHCore announcement comes out. I guess I still want to squeeze all the performance of my machine for folding / BOINC / whatever, but I refuse to buy parts specifically for F@H. So I'm stuck poking and prodding until what I have works ;)


All times are GMT -7. The time now is 02:33 PM.