Corrupted boot archive on Solaris X86

We have an old rack server that runs a version of Solaris 10 for X86. For the most part, this machine never gives any trouble and we very rarely need to reboot. However, in recent months when we have rebooted it,  it has sometimes failed to boot, failing with a system console error that looks something like:

cannot read biosint
trap type 13 (0xd) err code 0x98b8 eip=0x0
...lots of register and stack trace data...
panic: corrupted boot archive . . . boot loader
Press any key to reboot

We found the instructions detailed here resolved this issue for us (although the error message is not exactly the same).

Installing Nagios on Solaris

If you are not familiar with it, Nagios is an open source system and network monitoring application that keeps a watchful eye on hosts and services that you specify, alerting you when things go bad and when they get better. It was designed to run under Linux but, according to the creators, should also run on other Unix variants. With this in mind, I decided to try it out on Solaris.

Fortunately, they provide a very comprehensive User Guide and trust me, you’re going to need it. Suffice it to say, there is a section early in this document entitled Advice for Beginners which is pretty blunt about how tricky Nagios is to set up and, believe me, they are not wrong. Having said that, they do also say that once you get it running you will never want to be without it and I can definitely subscribe this this notion too.
Anyway, here are my notes from installing Nagios on Solaris:

My Setup

  • Solaris 10 (u3) for x86
  • Sun Studio 12
  • Nagios 2.10
  • Nagios Plugins 1.4.11

Building Nagios

I downloaded the Nagios source tar ball and unpacked it as a non-root user. Then, following the User Guide, I ran the configure script with no argument (implying I wanted all default settings) followed make all and this seemed to work fine.

Building Nagios Plugins

Nagios does most things by using other scripts/applications which it calls plugins. The Nagios website provides a collection of popular plugins for you to build. Once again, I did so using a configure command followed by a make all command. However, this did not go entirely smoothly:

  1. A number of the plugins failed to build citing an “undefined symbol: floor” error. This was resolved by adding -lm to the LIBS defined in line 328 of nagios-plugins-1.4.11/plugins/Makefile. This could probably also have been fixed by adding $(MATHLIBS) to the links statement of the affected plugins but that would have been more work.
  2. The check_dhcp module failed to compile citing several unknown data types (i.e. u_int8_t and u_int32_t). This was resolved by adding a -D__solaris__ to CPPFLAGS definitions at line 161 of nagios-plugins-1.4.11/plugins-root/Makefile
  3. The nagios-plugins-1.4.11/plugins-root/Makefile was also missing the same -lm link parameter as the plugins Makefile (line 221)

Once all of the above changes were make, all of the plugins seemed to build correctly.

Problems Found During Nagios Configuration

1. check_ping plugin did not work

No matter what way I configured the use of the check_ping plugin (in localhost.cfg, services section) it always reported:

CRITICAL – You need more args!!!
Could not open pipe:

A number of websites suggested that this was an IPv6 issue and that I should have used the --with-ipv6=no in my original call to the configure script when building the plugins. However, this was not the solution for me. It turns out that the definition of PING_COMMAND in nagios-plugins-1.4.11/config.h was empty and thus the check_ping plugin was actually making no attempt whatsoever to ping the requested host. I suspect that the reason for this is that I built the software as a non-root user which, on Solaris, does not have the ping command in it’s path (since ping is located in /usr/sbin on Solaris). Hence, the original configure script was unable to produce a valid definition for PING_COMMAND.

The solution to this was to edit nagios-plugins-1.4.11/config.h and add the following definition for PING_COMMAND (line 796)

#define PING_COMMAND “/usr/sbin/ping -s %s 64 %u”

The above command specific to Solaris and makes the Solaris version of ping behave like the Linux ping command. After this edit, I had to force a rebuild of the check_ping plugin (touch plugins/check_ping.c; make)

2. statusmap.cgi did not build

I only noticed this when I tried to view the Status Map section of Nagios. In short, the reason why this has not built is that I was missing a GD library on my Solaris system. The solution was to download and install a version of the GD library (and each of its dependent packages). I got mine from sunfreeware.com. The statusmap.cgi utility then built correctly and once I copied it to the libexec directory where Nagios was installed, it worked.

3. VRML Browser Plugin required

When I tried to view the 3-D Status Map options in Nagios, my brower kept launching a “Save As” dialog box. I turns out I needed to install a VRML plugin in my browser. I chose one called Cortona from Parallel Graphics. It seems to work fine in Firefox although, as yet, the 3-D Status Map view is more impressive than it is useful (for me anyway).

Conclusion

Nagios indeed took a long time to install, configure and set up. However, I can confirm that it was worth the effort and I am very pleased with it so far.

Corrupted Boot Archive after Solaris X86 patch update

I’ve installed a number of Solaris 10 X86 (U3) systems recently a very annoying issue on each one of them which results in the system not booting after installing the latest applicable patches for that system. Immediately after the GRUB boot menu times out and it attempts to boot Solaris, it returns with a “corrupted boot_archive. No boot device available” message. No other information is presented.

Here is how I recovered from this situation:

  1. Boot the system in Failsafe mode
  2. The system will detect your Solaris boot partition and offer to mount it on /a. Select Yes when asked about this.
  3. Once the system completes its Failsafe boot, go to /a/platform/i86pc and remove the file called boot_archive.
  4. Reboot the system using the “reboot” command wherby the system appears to re-generated the file you just deleted.
  5. The system should then boot normally again

After installation and registration of fresh Solaris system, I usually run the smpatch update command at least once to bring the system to a reasonable patch level (before installing any other software on it). I realise that this may not be entirely advisable in a live environment but on a fresh install, I feel it should be reasonable thing to do. After all, the man pages for the smpatch command state (for the update subcommand):

This subcommand analyzes the system, then downloads the appropriate updates from the Sun update server to your system. After the availability of the updates has been confirmed, the updates are applied based on the update policy. …If an update does not meet the policy for applying updates, the update is not applied.

I have used this technique several times on SPARC-based systems without issue. It only appears to happen on X86 installations.

Installing Solaris 10 x86 in VMware

I installed Solaris 10 x86 in a VMware Virtual Machine on a laptop earlier this week. It was mostly a straightforward process as I’d used VMware before to install Ubuntu Linux on a number of desktop systems. However, I did run into some trouble on the networking front.

The Problem

Whenever I generated a large amount of network traffic (i.e. copy a 300MB file onto it), the network driver (pcn0) seemed to fall over and die, rendering the VM unreachable from the outside world. I was using the SSH copy tool (scp) tool to carry out the file copy and the problem manifested itself by causing scp to report that the copy was stalled – a state from which it never returned. When I investigated from the system console, the pcn0 interface no longer had an IP address (but was still up). I had to reboot the VM to recover from this.

The Solution

In the end, the solution was to install VMware Tools which actually installs a different network driver (vmxnet) in place of the pcn driver. After VMware Tools was installed I did have to manually rename some of the networking files in /etc (hostname.pcn0 to hostname.vmxnet0 and dhcp.pcn0 to dhcp.vmxnet0) to get the system back on the network. But once I did that (and rebooted), every worked fine and I haven’t had any problems since.
If you happen to have a DVD/CD mounted in the VM (either physically or via ISO image), you should unmount if before attempting the VMware Tools installation as this process tries to mount an ISO image as part of the installation. If you fail to do this, the VMware Tools installation process will pretty much just sit there and give you no feedback as to what’s happening. Despite this, I am still a big fan of VMware and of Solaris.

I used Solaris 10 U3 (10/06) and VMware Server for Windows 1.0.3 Build 44356.