Forum Replies Created
-
AuthorPosts
-
June 18, 2008 at 7:47 pm in reply to: Moving Volumes to a Newly Added eSATA array/Expanding System Volume #373182
apr400
ParticipantI hadn’t thought of ASR – thanks.
June 18, 2008 at 9:34 am in reply to: Moving Volumes to a Newly Added eSATA array/Expanding System Volume #373168apr400
ParticipantHi All,
Lots of views, but no responses – anyone care to venture an opinion as to whether my scheme will work?
Many thanks
apr400
ParticipantAn update on our XServe –
Having had the dreaded restart issue, it settled down and worked between May and December without a problem. The restart issue reappeared at the beginning of the month and increased in frequency until yesterday the PSU died. Not sure if that was a symptom or a cause of all our problems, but it’s interesting to note that Applecare tell me there is at least a 4 week waiting list for G5 power supplies – maybe lots of G5 Xserve PSUs are failing at the moment.
Also, when I removed the PSU I discovered that the logic board had be incorrectly installed in the factory, so that it was bent up over one of it’s location pins – can’t have helped!
@Aftercare – the PMU 122 error, simply means the machine lost power instantly. I’ve seen it suggested that 122 is stored as the default error and so it’s always there if something happens that prevents the machine from writing an error out. (http://www.gibbilicious.com/gibbilicious/2006/02/troubleshooting_the_mysterious.html)
Anyway – off to fend off users who want their email (a month for a server PSU – truly ridiculous)
apr400
ParticipantUpdate:
I did have the problem tracked to copy lots of data to an external firewire drive. Every time I did this the machine fell over an hour and a half later and required a PMU reset to get out of the restart every five minutes cycle.
However, it has for the moment stopped doing this. I have been having some chats with various Applecare people, who thought initially that it might be a bad logic board (bad nvram possibly, as indicated by the fact that auto-restart settings weren’t being followed {ie it did when it was set not to}. Now it seems more likely that there was some sort of loose connection and that open the case may have reseated something. The machine has now been up for 10 days without a problem, albeit under fairly light load.
For anyone that wants it I have a script I wrote that maintains a rolling log of a selectable number of processes using top. You can specify how much history to retain, how often to sample, how to sort the processes (CPU, memory etc). It proved quite helpful in convincing me that I didn’t have a runaway process, and might help also if you do, especially if you’re not seeing anything unusual in the logs. Email me via my profile if you want it.
Anyway for the moment everything is working and I am going to give the machine a bit of stress testing before I try to deploy again.
apr400
Participant[QUOTE][u]Quote by: apr400[/u]
I did discover from the logs that retrospect was doing something which surprised me as I had turned it off. I have now uninstalled it. I also discovered a bad script (homebrew) in the crontab which I corrected, and found that the firmware of one of the lacie firewire devices was 1.05 instead of the latest 1.07 so I updated that.Anyway, going to leave it on again tonight and see what happens.
</p>[/QUOTE]Well, that didn’t help unfortunately. This time it fell over within a couple of hours. Same symptoms.
So next test scenario – no firewire devices attached, and moved to a UPS. Maybe it will be up when I get back in the morning!
Alex
apr400
ParticipantWell the PMU voodoo didn’t stick. The server ran til around five in the morning and then went back into the restart every five minutes cycle.
I managed to get the xrdiags working (followed the comments in the macosxhints forum posting by javaist which has a slightly differnet method of setting up the paths) and got a clean bill of health.
I did discover from the logs that retrospect was doing something which surprised me as I had turned it off. I have now uninstalled it. I also discovered a bad script (homebrew) in the crontab which I corrected, and found that the firmware of one of the lacie firewire devices was 1.05 instead of the latest 1.07 so I updated that.
Anyway, going to leave it on again tonight and see what happens.
One other odd thing from the logs sometimes my machine is localhost, and sometimes nprl (it’s actual name) (ie goes for long periods as one and then long periods as the other. Is that normal?
ttfn
Alexapr400
ParticipantI have a server crashing with similar symptoms as the above, but much greater rate.
Thought I’d add my experiences as there doesn’t seem to be all that much about this on the web, plus a few queries to other people who’ve had this problem.
I have an XServe that I am configuring to replace a linux server (machine environment and configuration at the bottom of this post). Everything was going well and I had started moving users mail across from the old server when after about 15 successful migrations the XServe started to crash. The machine simple stopped, and then restarted itself. After this it went into a cycle where every five minutes or so it would switch off and then restart. During the auto reboot it seemed to struggle a bit, powering the blowers up and down and flashing the System Warning light for a minute before stopping completely for a minute and then starting up as normal.
(Is it supposed to do that (the struggling to start business I mean rather than the crashing!)? {This is my first XServe})
After rebooting I looked in the logs – nothing useful before the crash (With the exception that just prior to the SECOND crash the last thing written was a few lines of binary – which is odd). The only helpful log item is a PMU FORCED SHUTDOWN, CAUSE = -122 after the restart. The power is supplied from a surge protected socket (Not UPS), several other machines run off the same circuit without trouble. I checked the power cable, and also tried a different circuit without joy – the server continued to reset after five minutes of uptime.
I also watched activity monitor to keep an eye on run away cpu or memory usages – nothing going on there. Server Monitor also showed nothing abnormal going on.
Checked crontab and cron – nothing running at that time.
So I started disabling services, basing my choices on the last thing to write to the system log.
Shutdown DiskSpaceMonitor – still crashed
Shutdown DSM and watchdog – still crashed
Shutdown DSM, WD and NTPD (checking apple time server) – still crashedAt this point I got a bit irritated with the thing, and so after startup I shutdown everything, all of the above, plus everything in Server Manager, plus slapd.
And the server stayed up. After an hour and a half, I started bringing things back up. Having started all of the services that were previously running the server stayed up for another hour or so without trouble. I then logged in to an ibook authenticating via LDAP, and mounting a (default (ie empty)) home from the server via AFP. Based on what I had read in various forums I was keeping a very close eye on the cpu usage. However the moment login began (ie as I hit enter on the ibook) the server shutdown immediately. (ie no time for watchdog to have been prevented from running, and no time for anything to be written to the log other than the first three steps of the Kerberos auth (up to localhost krb5kdc[341]: TGS_REQ…).)After auto restart I left services on, but didn’t connect the ibook, and it shutdown at five minutes as before. So somehow having the services off for a while and then bringing them back up allowed the server to run until the ibook log in.
Anyway at this stage I pulled all of the leads on the Xserve and opened it up. I checked the battery – 3.7 volts. I reset the PMU, and then reattached everything bar the Firewire. Booted up again – ran normally for an hour. Logged in the ibook – no problems. Called it an evening at that point (what with the configuration and the migration and the crashing I’d been at my desk for 44 hours at that point!)
Came in this morning and ran the server with no problems. At lunch I shut it down and reattached the firewire drives. Restarted – no problems so after an hour auth’d the ibook, and still going strong – uptime now 3 hours. I am a bit worried that there is not much loading on the server yet – difficult to replicate exactly the transfer of mail that started all this, as I have had to redeploy the old server back to the main net for the moment. I want to retry the migration this coming weekend, but am having worries re the XServe’s reliability at this point.
If you have had this problem and managed to sort it via the PMU voodoo – does the problem reoccur? My nightmare involves moving umpteen gigabytes of user files and mail to the new server and then having to move it all back in five minute steps (especially given the format changes mean I can’t easily just pull the Xserve disks and pop them in the linux server).
Going through various forums I have not found much on this situation other than this thread. Lots of people seem to have the crashes, lots of solutions are suggested, but not much is reported back re success. Also I have not heard of anyone crashing with quite my frequency. I have seen suggestions that it is due to the firewire disappearance issue as well, although at the moment my drives seem to be playing nice and they weren’t mounted during any of the problems (for what that’s worth), and I would like to avoid losing my local backup capability. (I use the disk to stage nightly backups before transfer to offsite NAS)
Also, just to be safe I would love to run Apple Hardware Test, but I only have one Mac OS X server, and I’ll be buggered if I can get Javaist’s tip on the bootable xrdiag cd to work – tried both methods and just don’t seem to be able to boot of the resulting disks.
(One thing of note – my system chokes on the suggested
sudo bless –folder path
but does accept
sudo bless -folder "path"
would that make any difference?)
Anyone got any suggestions on that.
Anyway thanks for reading this far through a long post. Config is below.
______System Configuration_______
Xserve G5, 1 CPU 2GHz, 1GB MemorySome History on the server.
The Server is one and a half years old, but for various reasons only brought into service about 2 weeks ago so it’s basically a new machine (albeit without a warranty unfortunately). All updates to 10.3.9 were applied before configuration. In order to configure the server we have a small private network with a NAT to the main LAN. This allows the XServe to look up its DNS on the main company DNS whilst having the same name and IP as the old server, and whilst keeping the old server running in the main lan. DNS is rock solid and we have had none of the normal DNS LDAP issues.
The server has a VGA card. The monitor keyboard and mouse are by KVM, with a local and remote option (Adderview OSD with AdderLink X-Silver). There is an additional local Mac style keyboard attached by USB. There are two firewire devices attached – a tape backup drive (LaCie d2) and a Disk (LaCie d2) – Neither are left mounted during normal operation – they are for backups. There are also two ethernet cables, although only one NIC is on at the moment. On the net with the server – a linux server, an ibook (10.3.9), a windows (XP) and a printer. Power is through a surge protected socket.
The server is running Open Directory, as an OD master, AFP, Firewall, Mail, and Windows Services (as PDC). User home directories are not on the startup volume, but a raided Home Volume (Xserve disks 2 and 3). User mailboxes are on a Mailboxes partition (same disk different volume to Startup). Users have AFP and SMB access (SMB by virtual shares). Until the crashes everything was running smoothly after configuration, with various windows and mac machines able to connect via SSL versions of services, Kerberos and Password auth all working etc etc.
-
AuthorPosts
Recent Comments