[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Testbed-admins] SOLVED: Generic Boot Image Weirdness



After much wailing and gnashing of teeth trying to troubleshoot this problem off the list we found the problem. I wanted to post back to the list about the solution.

Isn't it always something stupid when you finally figure it out...?

It turned out that it was a problem with the imagezip process, which led to a corrupted image getting installed on my test nodes via the frisbee boot loader.

The image customization procedure has the imagezip output stored on /proj, which is actually an nfs mount from ops. Whoever set up those nfs mounts at this end made them mighty small :) I did mention that I inherited this project, right?

All that was happening was that imagezip was running out of space to store the image on /proj, at which point it crapped out leaving a corrupt/truncated image on boss. All the kernel messages about a full file system were on the ops console, which was why I didn't see them. Ops is on a different KVM port and I rarely go there. Sigh.

The key bit of evidence was a one-line error message returned by the imagezip process, which I should have paid closer attention to!

From boss I ran imagezip:

> # ssh pc6 imagezip /dev/$DSK - > $FULL-STD.ndz
>   Slice 4 is unused, NOT SAVING.
> select: Bad file descriptor

The "bad file descriptor" message came from ssh and was caused by the full file system. I moved to a file system with a lot more space to store the imagezip image(s) and everything worked like a champ.

> # ssh pc6 imagezip /dev/$DSK - > $FULL-STD.ndz
>   Slice 4 is unused, NOT SAVING.
> 6432201728 input (2471217152 compressed) bytes in 245.273 seconds
> Image size: 695205888 bytes
> 9.609MB/second compressed
> Finished in 247.239 seconds

Thanks for the all help!

On 6/24/2010 5:28 PM, Barry Trent wrote:
Summary:

I'm unable to boot generic FreeBSD6.2 on most (but not all!) of my
testbed machines. The /usr partition is shown as corrupt at boot time:

/dev/ad0s1f: /dev/ad0s1f: BAD SUPERBLOCK VALUE: VALUES IN SUPERBLOCK
DISAGREE WITH THOSE IN FIRST ALTERNATE

More Details:

I customized generic BSD6.2 and Fedora Core 6 images as described in the
install instructions, making the default combined-image file
FBSD62+FC6-GENERIC.ndz.

When I nfree my nodes out of hwdown and boot them, they come up in
frisbee and download/install the combined generic image onto their hard
drive, reboot and go to the prompt where they wait for further boot
instructions. So far so good.

The image itself is apparently good, because on ONE of my machines I can
enter Ctrl-C to go into interactive mode and boot "part:1" (the BSD6.2
image) successfully.

Unfortunately, on the other 8 machines I have tried so far, this same
procedure leads to the following:
-----
/dev/ad0s1a: FILE SYSTEM CLEAN; SKIPPING CHECKS
/dev/ad0s1a: clean, 25786 free (1138 frags, 3081 blocks, 1.8%
fragmentation)
/dev/ad0s1f: /dev/ad0s1f: BAD SUPERBLOCK VALUE: VALUES IN SUPERBLOCK
DISAGREE WITH THOSE IN FIRST ALTERNATE
/dev/ad0s1f: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
/dev/ad0s1f: CANNOT WRITE BLK: 12000
/dev/ad0s1f: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
/dev/ad0s1e: FILE SYSTEM CLEAN; SKIPPING CHECKS
/dev/ad0s1e: clean, 120395 free (251 frags, 15018 blocks, 0.2%
fragmentation)
THE FOLLOWING FILE SYSTEM HAD AN UNEXPECTED INCONSISTENCY:
ufs: /dev/ad0s1f (/usr)
Automatic file system check failed; help!
Jun 24 16:26:18 init: /bin/sh on /etc/rc terminated abnormally, going to
single user mode
Enter full pathname of shell or RETURN for /bin/sh:
-----

So apparently the image is somehow getting corrupted on almost all my
machines during the frisbee download process or during the later
"part:1" boot?

The machines are all identical (I think a few may have a slightly
different BIOS revision). I disassembled two machines (the one that
works and one that doesn't) and compared the hard drives. Identical part
numbers and firmware revs. I swapped the drives between the machines and
the problem followed the drive, not the chassis.

Strangest of all, I used nalloc/nfree to force a re-install of the image
by frisbee on these two machines, and the problem STILL followed the
hard drive!

WTF? Is it possible that there is some residue on the drives that's
causing me trouble? Do I need to wipe the drives before using them?

Any suggestions? I'm at a loss...again.