Several nodes are stuck in reloading. I believe this started after a power outage. But from
what I can tell my system were not affected, they are all on UPS and they were
all on the next day. This is the issue my nodes are getting stuck in reloading
mode and they are either reporting yellow as possibly down or red as down. I discovered this problem after I tried to power on a few
systems that I turned off over the weekend to conserve power. I put them
into HWDOWN before I powered them off. I powered them back on and tried
to free them from HWDOWN and they get stuck at reloading. I consoled into the nodes when they were reloading and this
is how they end: nfs_diskless: no server Trying to mount root from ufs:/dev/md0c Pre-seeding PRNG: kickstart. Loading configuration files. Entropy harvesting: interrupts ethernet point_to_point
kickstart. Fast boot: skipping disk checks. Setting hostname: . Emulab looking for control net among: bce0 em0 em1 em2
em3 bce1 ... em3: link state changed to DOWN em2: link state changed to DOWN em1: link state changed to DOWN em0: link state changed to DOWN bce1: link state changed to UP em3: link state changed to UP em2: link state changed to UP em1: link state changed to UP em0: link state changed to UP Terminated Emulab control net is bce1 lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu
16384 inet 127.0.0.1
netmask 0xff000000 inet6 ::1
prefixlen 128 inet6 fe80::1%lo0
prefixlen 64 scopeid 0x7 Generating nsswitch.conf. Generating host.conf. Additional routing options:. hw.bus.devctl_disable: 0 -> 1 Mounting NFS file systems:. ELF ldconfig path: /lib /usr/lib a.out ldconfig path: /usr/lib/aout Starting local daemons:Playing Frisbee ... Authenticated IPOD enabled from 192.168.0.14/255.255.255.255 WARNING: kernel limits buffering to 1907 MB da0: write-cache already on Invalidating old potential superblocks: 63 6281415 22667715
31053645 Running /etc/testbed/frisbee -S 192.168.0.14 -M
1907 -i 192.168.1.1 -m 234.5.15.100 -p 7504 /dev/da0 at Wed Jan 27
09:50:49 MST 2010 Bound to port 7504 Using Multicast Joined the team after 980
sec. ID is 1329151917. File is 1542 chunks (1579008 blocks) I then tried to run an experiment on one of the nodes that
were still up and the experiment failed and the system is stuck in reloading. When I ran the 1 node experiment this is how that
ended. *** os_setup: Not waiting for pc23 since its reload/reboot
failed! TIMESTAMP: 12:33:22:441184 rebooting/reloading finished TIMESTAMP: 12:33:22:441613 Local node waiting started TIMESTAMP: 12:33:22:442004 Local node waiting finished OS Setup Done. *** ERROR: os_setup: *** There were 1 failed nodes. *** *** 1/1 nodes failed to load the os
"CENTOS53-64-STD": A(pc23) TIMESTAMP: 12:33:22:444017 os_setup finished *** ERROR: tbswap: Failed to reset OS and reboot nodes. Not retrying due to error type. Cleaning up after errors. Stopping the event system -q -o BatchMode=yes -o StrictHostKeyChecking=no
ops.emulab.naresky /usr/testbed/sbin/eventsys.proxy -d -l \ /proj/AFRLTestbed/exp/quicktest/logs/event-sched.log
-k /proj/AFRLTestbed/exp/quicktest/tbdata/eventkey -g \ AFRLTestbed -e AFRLTestbed/quicktest -u emanager stop Trying ssh protocol 2... TIMESTAMP: 12:33:24:48155 snmpit started Removing VLANs. snmpit: AFRLTestbed/quicktest has no VLANs to remove,
skipping TIMESTAMP: 12:33:24:659918 snmpit finished Tearing down virtual nodes. TIMESTAMP: 12:33:24:661217 vnode_setup -k started vnode_setup running at parallelization: 10 wait_time: 120 Vnode teardown finished. TIMESTAMP: 12:33:26:206311 vnode_setup finished Freeing nodes. TIMESTAMP: 12:33:26:207621 nfree started Releasing all nodes from experiment [Experiment: AFRLTestbed/quicktest]. Moving [Node: pc23] to [Experiment:
emulab-ops/reloadpending] -q -o BatchMode=yes -o StrictHostKeyChecking=no -n
tips.emulab.naresky /usr/testbed/sbin/console_setup.proxy \ pc23 emulab-ops Trying ssh protocol 2... TIMESTAMP: 12:33:28:828800 nfree finished Resetting named maps. TIMESTAMP: 12:33:28:830408 named started TIMESTAMP: 12:33:29:439340 named finished Resetting email lists. TIMESTAMP: 12:33:29:440691 genelists started TIMESTAMP: 12:33:30:2964 genelists finished Resetting DB. Failingly finished swap-in for AFRLTestbed/quicktest.
12:33:30:5886 TIMESTAMP: 12:33:30:6258 tbswap in finished (failed) *** ERROR: swapexp: tbswap in failed! Cleaning up and exiting with status 1 ... **** Experimental information, please ignore **** Session ID = 141149 Likely Cause of the Problem: There were 1 failed nodes. 1/1 nodes failed to load the os
"CENTOS53-64-STD": A(pc23) Cause: unknown Confidence: 0.7 Script: os_setup **** End experimental information **** -------------------------------------------------------------------------------------------------------- Have any ideas? Thank you, Donna |