[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Testbed-admins] Nodes Stuck in reloading

To: "Mike Hibler" <mike@flux.utah.edu>
From: "Korrie, Donna M CTR USAF AFMC AFRL/RYRD" <Donna.Korrie.ctr@rl.af.mil>
Subject: Re: [Testbed-admins] Nodes Stuck in reloading
Date: Wed, 27 Jan 2010 14:45:42 -0500

Here is what I found on the control network switch for vlan3

IGMP snooping is globally enabled
IGMP snooping is enabled on this interface
IGMP snooping fast-leave (for v2) is disabled and queried is disabled
IGMP snooping explicit-tracking is enabled
IGMP snooping last member query response interval is 1000 ms
IGMP snooping report-suppression is enabled



-----Original Message-----
From: Mike Hibler [mailto:mike@flux.utah.edu] 
Sent: Wednesday, January 27, 2010 1:54 PM
To: Korrie, Donna M CTR USAF AFMC AFRL/RYRD
Cc: testbed-admins@flux.utah.edu; Leigh Stoller
Subject: Re: [Testbed-admins] Nodes Stuck in reloading

It looks like frisbee is not starting up for you.

Let's try the easy thing first.  It is possible that the DB has leftover
state that prevents a frisbee server from starting.  First see if you
have
a frisbeed running on your boss for the image in question
(CENTOS53-64-STD
or whatever).  If not, go into the web interface and goto the info for
that image: Experimentation -> List ImageIDs -> <select your image>.
It should show the "Load Address" and "Frisbee pid" as blank.  If those
are
set, then the system thinks there is a server running for that image
when
it really isn't.  You can edit the image descriptor and clear the load
address and set the pid to 0.

If that is not the problem, then you probably have a multicast issue.
Did your control net switch lose power in the outage?  You may need to
go
in and make sure IGMP snooping is enabled.

On Wed, Jan 27, 2010 at 01:23:41PM -0500, Korrie, Donna M CTR USAF AFMC
AFRL/RYRD wrote:
> Several nodes are stuck in reloading.
> 
>  
> 
> I believe this started after a power outage.  But from what I can tell
> my system were not affected, they are all on UPS and they were all on
> the next day.
> 
>  
> 
> This is the issue my nodes are getting stuck in reloading mode and
they
> are either reporting yellow as possibly down or red as down.
> 
>  
> 
> I discovered this problem after I tried to power on a few systems that
I
> turned off over the weekend to conserve power.  I put them into HWDOWN
> before I powered them off.  I powered them back on and tried to free
> them from HWDOWN and they get stuck at reloading.
> 
>  
> 
> I consoled into the nodes when they were reloading and this is how
they
> end:
> 
>  
> 
> nfs_diskless: no server
> 
> Trying to mount root from ufs:/dev/md0c
> 
> Pre-seeding PRNG: kickstart.
> 
> Loading configuration files.
> 
> Entropy harvesting: interrupts ethernet point_to_point kickstart.
> 
> Fast boot: skipping disk checks.
> 
>  
> 
> Setting hostname: .
> 
> Emulab looking for control net among:  bce0 em0 em1 em2 em3 bce1 ...
> 
> em3: link state changed to DOWN
> 
> em2: link state changed to DOWN
> 
> em1: link state changed to DOWN
> 
> em0: link state changed to DOWN
> 
> bce1: link state changed to UP
> 
> em3: link state changed to UP
> 
> em2: link state changed to UP
> 
> em1: link state changed to UP
> 
> em0: link state changed to UP
> 
> Terminated
> 
> Emulab control net is bce1
> 
> lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
> 
>         inet 127.0.0.1 netmask 0xff000000
> 
>         inet6 ::1 prefixlen 128
> 
>         inet6 fe80::1%lo0 prefixlen 64 scopeid 0x7
> 
> Generating nsswitch.conf.
> 
> Generating host.conf.
> 
> Additional routing options:.
> 
> hw.bus.devctl_disable: 0 -> 1
> 
> Mounting NFS file systems:.
> 
> ELF ldconfig path: /lib /usr/lib
> 
> a.out ldconfig path: /usr/lib/aout
> 
> Starting local daemons:Playing Frisbee ...
> 
> Authenticated IPOD enabled from 192.168.0.14/255.255.255.255
> 
> WARNING: kernel limits buffering to 1907 MB
> 
> da0: write-cache already on
> 
> Invalidating old potential superblocks: 63 6281415 22667715 31053645
> 
> Running /etc/testbed/frisbee -S 192.168.0.14 -M 1907   -i 192.168.1.1
-m
> 234.5.15.100 -p 7504 /dev/da0 at Wed Jan 27 09:50:49 MST 2010
> 
> Bound to port 7504
> 
> Using Multicast
> 
> Joined the team after 980 sec. ID is 1329151917. File is 1542 chunks
> (1579008 blocks)
> 
>  
> 
> I then tried to run an experiment on one of the nodes that were still
up
> and the experiment failed and the system is stuck in reloading.
> 
>  
> 
> When I ran the 1 node experiment this is  how that ended.  
> 
>  
> 
> *** os_setup: Not waiting for pc23 since its reload/reboot failed!
> 
> TIMESTAMP: 12:33:22:441184 rebooting/reloading finished
> 
> TIMESTAMP: 12:33:22:441613 Local node waiting started
> 
> TIMESTAMP: 12:33:22:442004 Local node waiting finished
> 
> OS Setup Done. 
> 
> *** ERROR: os_setup:
> 
> ***   There were 1 failed nodes.
> 
> ***   
> 
> ***   1/1 nodes failed to load the os "CENTOS53-64-STD": A(pc23)
> 
> TIMESTAMP: 12:33:22:444017 os_setup finished
> 
> *** ERROR: tbswap: Failed to reset OS and reboot nodes.
> 
> Not retrying due to error type.
> 
> Cleaning up after errors.
> 
> Stopping the event system
> 
> -q -o BatchMode=yes -o StrictHostKeyChecking=no ops.emulab.naresky
> /usr/testbed/sbin/eventsys.proxy -d -l \
> 
>  /proj/AFRLTestbed/exp/quicktest/logs/event-sched.log -k
> /proj/AFRLTestbed/exp/quicktest/tbdata/eventkey -g \
> 
>  AFRLTestbed -e AFRLTestbed/quicktest -u emanager stop
> 
> Trying ssh protocol 2...
> 
> TIMESTAMP: 12:33:24:48155 snmpit started
> 
> Removing VLANs.
> 
> snmpit: AFRLTestbed/quicktest has no VLANs to remove, skipping
> 
> TIMESTAMP: 12:33:24:659918 snmpit finished
> 
> Tearing down virtual nodes.
> 
> TIMESTAMP: 12:33:24:661217 vnode_setup -k started
> 
> vnode_setup running at parallelization: 10 wait_time: 120
> 
> Vnode teardown finished.
> 
> TIMESTAMP: 12:33:26:206311 vnode_setup finished
> 
> Freeing nodes.
> 
> TIMESTAMP: 12:33:26:207621 nfree started
> 
> Releasing all nodes from experiment [Experiment:
AFRLTestbed/quicktest].
> 
> Moving [Node: pc23] to [Experiment: emulab-ops/reloadpending]
> 
> -q -o BatchMode=yes -o StrictHostKeyChecking=no -n tips.emulab.naresky
> /usr/testbed/sbin/console_setup.proxy \
> 
>  pc23 emulab-ops
> 
> Trying ssh protocol 2...
> 
> TIMESTAMP: 12:33:28:828800 nfree finished
> 
> Resetting named maps.
> 
> TIMESTAMP: 12:33:28:830408 named started
> 
> TIMESTAMP: 12:33:29:439340 named finished
> 
> Resetting email lists.
> 
> TIMESTAMP: 12:33:29:440691 genelists started
> 
> TIMESTAMP: 12:33:30:2964 genelists finished
> 
> Resetting DB.
> 
> Failingly finished swap-in for AFRLTestbed/quicktest. 12:33:30:5886
> 
> TIMESTAMP: 12:33:30:6258 tbswap in finished (failed)
> 
> *** ERROR: swapexp: tbswap in failed!
> 
> Cleaning up and exiting with status 1 ... 
> 
> **** Experimental information, please ignore ****
> 
> Session ID = 141149
> 
> Likely Cause of the Problem:
> 
>   There were 1 failed nodes.
> 
>   
> 
>   1/1 nodes failed to load the os "CENTOS53-64-STD": A(pc23)
> 
> Cause: unknown
> 
> Confidence: 0.7
> 
> Script: os_setup
> 
> **** End experimental information ****
> 
>
------------------------------------------------------------------------
> --------------------------------
> 
>  
> 
>  
> 
> Have any ideas?
> 
>  
> 
> Thank you,
> 
> Donna
> 
>  
> 
>  
> 

> _______________________________________________
> Testbed-admins mailing list
> Testbed-admins@flux.utah.edu
> http://www.flux.utah.edu/mailman/listinfo/testbed-admins

Follow-Ups:
- Re: [Testbed-admins] Nodes Stuck in reloading
  - From: Mike Hibler <mike@flux.utah.edu>

References:
- [Testbed-admins] Nodes Stuck in reloading
  - From: "Korrie, Donna M CTR USAF AFMC AFRL/RYRD" <Donna.Korrie.ctr@rl.af.mil>
- Re: [Testbed-admins] Nodes Stuck in reloading
  - From: Mike Hibler <mike@flux.utah.edu>

Prev by Date: Re: [Testbed-admins] Nodes Stuck in reloading
Next by Date: Re: [Testbed-admins] Nodes Stuck in reloading
Previous by thread: Re: [Testbed-admins] Nodes Stuck in reloading
Next by thread: Re: [Testbed-admins] Nodes Stuck in reloading
Index(es):
- Date
- Thread