[Testbed-admins] Nodes Stuck in reloading

Several nodes are stuck in reloading.

I believe this started after a power outage. But from what I can tell my system were not affected, they are all on UPS and they were all on the next day.

This is the issue my nodes are getting stuck in reloading mode and they are either reporting yellow as possibly down or red as down.

I discovered this problem after I tried to power on a few systems that I turned off over the weekend to conserve power. I put them into HWDOWN before I powered them off. I powered them back on and tried to free them from HWDOWN and they get stuck at reloading.

I consoled into the nodes when they were reloading and this is how they end:

nfs_diskless: no server

Trying to mount root from ufs:/dev/md0c

Pre-seeding PRNG: kickstart.

Loading configuration files.

Entropy harvesting: interrupts ethernet point_to_point kickstart.

Fast boot: skipping disk checks.

Setting hostname: .

Emulab looking for control net among: bce0 em0 em1 em2 em3 bce1 ...

em3: link state changed to DOWN

em2: link state changed to DOWN

em1: link state changed to DOWN

em0: link state changed to DOWN

bce1: link state changed to UP

em3: link state changed to UP

em2: link state changed to UP

em1: link state changed to UP

em0: link state changed to UP

Terminated

Emulab control net is bce1

lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384

inet 127.0.0.1 netmask 0xff000000

inet6 ::1 prefixlen 128

inet6 fe80::1%lo0 prefixlen 64 scopeid 0x7

Generating nsswitch.conf.

Generating host.conf.

Additional routing options:.

hw.bus.devctl_disable: 0 -> 1

Mounting NFS file systems:.

ELF ldconfig path: /lib /usr/lib

a.out ldconfig path: /usr/lib/aout

Starting local daemons:Playing Frisbee ...

Authenticated IPOD enabled from 192.168.0.14/255.255.255.255

WARNING: kernel limits buffering to 1907 MB

da0: write-cache already on

Invalidating old potential superblocks: 63 6281415 22667715 31053645

Running /etc/testbed/frisbee -S 192.168.0.14 -M 1907 -i 192.168.1.1 -m 234.5.15.100 -p 7504 /dev/da0 at Wed Jan 27 09:50:49 MST 2010

Bound to port 7504

Using Multicast

Joined the team after 980 sec. ID is 1329151917. File is 1542 chunks (1579008 blocks)

I then tried to run an experiment on one of the nodes that were still up and the experiment failed and the system is stuck in reloading.

When I ran the 1 node experiment this is how that ended.

*** os_setup: Not waiting for pc23 since its reload/reboot failed!

TIMESTAMP: 12:33:22:441184 rebooting/reloading finished

TIMESTAMP: 12:33:22:441613 Local node waiting started

TIMESTAMP: 12:33:22:442004 Local node waiting finished

OS Setup Done.

*** ERROR: os_setup:

*** There were 1 failed nodes.

***

*** 1/1 nodes failed to load the os "CENTOS53-64-STD": A(pc23)

TIMESTAMP: 12:33:22:444017 os_setup finished

*** ERROR: tbswap: Failed to reset OS and reboot nodes.

Not retrying due to error type.

Cleaning up after errors.

Stopping the event system

-q -o BatchMode=yes -o StrictHostKeyChecking=no ops.emulab.naresky /usr/testbed/sbin/eventsys.proxy -d -l \

/proj/AFRLTestbed/exp/quicktest/logs/event-sched.log -k /proj/AFRLTestbed/exp/quicktest/tbdata/eventkey -g \

AFRLTestbed -e AFRLTestbed/quicktest -u emanager stop

Trying ssh protocol 2...

TIMESTAMP: 12:33:24:48155 snmpit started

Removing VLANs.

snmpit: AFRLTestbed/quicktest has no VLANs to remove, skipping

TIMESTAMP: 12:33:24:659918 snmpit finished

Tearing down virtual nodes.

TIMESTAMP: 12:33:24:661217 vnode_setup -k started

vnode_setup running at parallelization: 10 wait_time: 120

Vnode teardown finished.

TIMESTAMP: 12:33:26:206311 vnode_setup finished

Freeing nodes.

TIMESTAMP: 12:33:26:207621 nfree started

Releasing all nodes from experiment [Experiment: AFRLTestbed/quicktest].

Moving [Node: pc23] to [Experiment: emulab-ops/reloadpending]

-q -o BatchMode=yes -o StrictHostKeyChecking=no -n tips.emulab.naresky /usr/testbed/sbin/console_setup.proxy \

pc23 emulab-ops

Trying ssh protocol 2...

TIMESTAMP: 12:33:28:828800 nfree finished

Resetting named maps.

TIMESTAMP: 12:33:28:830408 named started

TIMESTAMP: 12:33:29:439340 named finished

Resetting email lists.

TIMESTAMP: 12:33:29:440691 genelists started

TIMESTAMP: 12:33:30:2964 genelists finished

Resetting DB.

Failingly finished swap-in for AFRLTestbed/quicktest. 12:33:30:5886

TIMESTAMP: 12:33:30:6258 tbswap in finished (failed)

*** ERROR: swapexp: tbswap in failed!

Cleaning up and exiting with status 1 ...

**** Experimental information, please ignore ****

Session ID = 141149

Likely Cause of the Problem:

There were 1 failed nodes.

1/1 nodes failed to load the os "CENTOS53-64-STD": A(pc23)

Cause: unknown

Confidence: 0.7

Script: os_setup

**** End experimental information ****

--------------------------------------------------------------------------------------------------------

Have any ideas?

Thank you,

Donna