[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Testbed-admins] Nodes Stuck in reloading



Assuming that bce0 is the interface for the Emulab side and not your external
facing network, then this is correct.

So free a node from "reloading" (or "hwdown or whereever they are stuck):

   nfree emulab-ops reloading pcXXX

This will force it back through the reload process.  One the node reaches
the state below where it says it is "Using Multicast", do a ps on boss and
make sure a corresponding "frisbeed" is running and using the same MC address
and port.  Look at /usr/testbed/log/frisbeed.log and see if it saw a JOIN
request from the node and do a tcpdump to see if there is any traffic coming
from or going to that node; e.g.

   tcpdump -n -i bce0 host 192.168.0.14

Last time we had talked about making sure that the server was doing
"keep alives", but I don't know if we decided that was necessary.  Anyway,
look at frisbeelauncher for:

#
# Force multicast keepalives if necessary
#
if ($ELABINELAB) {
    $args .= " -K 15";
}

We were going to change that conditional to "(1 || $ELABINELAB)" just to
force it.

On Thu, Jan 28, 2010 at 01:22:37PM -0500, Korrie, Donna M CTR USAF AFMC AFRL/RYRD wrote:
> Mike,
> 
> When I ran "netstat -ran" this line was there:
> 
> 234.0.0/8          link#1             UCS         0        0   bce0
> 
> I verified that the lines
> static_routes="frisbee"
> route_frisbee="-net 234.0.0.0/8 -interface bce0"
> are in the /etc/rc.conf file
> 
> When I run /etc/rc.d/routing restart this is the results:
> route: writing to routing socket: File exists
> add net 234.0.0.0: gateway bce0: route already in table
> Additional routing options:.
> 
> I am still getting the same results when I try to remove a system form
> reloading or reboot the system this is as far as it gets: 
> Starting local daemons:Playing Frisbee ...
> Authenticated IPOD enabled from 192.168.0.14/255.255.255.255
> WARNING: kernel limits buffering to 1907 MB
> da0: write-cache already on
> Invalidating old potential superblocks: 63 6281415 22667715 31053645
> Running /etc/testbed/frisbee -S 192.168.0.14 -M 1907   -i 192.168.1.1 -m
> 234.5.0
> Bound to port 7511
> Using Multicast
> 
> 
> Any other thoughts?
> 
> 
> -----Original Message-----
> From: Mike Hibler [mailto:mike@flux.utah.edu] 
> Sent: Wednesday, January 27, 2010 3:44 PM
> To: Korrie, Donna M CTR USAF AFMC AFRL/RYRD
> Cc: Mike Hibler; testbed-admins@flux.utah.edu; Leigh Stoller
> Subject: Re: [Testbed-admins] Nodes Stuck in reloading
> 
> I am trying to recall for sure, but I think we have been here before...
> Ah, went looking back at old mail.  Last time you had this issue it was
> because
> your boss didn't have a route for frisbeed's multicast traffic.
> 
> Do a "netstat -ran" on your boss and see if there is a route for
> 234.0.0.0/8.
> What I said previously:
> 
> > Add the following to your boss:/etc/rc.conf file:
> > 
> >   static_routes="frisbee"
> >   route_frisbee="-net 234.0.0.0/8 -interface bce0"
> > 
> > and then do:
> > 
> >   sudo /etc/rc.d/routing restart
> 
> I am pretty sure we decided you didn't need a multicast router.
> 
> On Wed, Jan 27, 2010 at 02:45:42PM -0500, Korrie, Donna M CTR USAF AFMC
> AFRL/RYRD wrote:
> > Here is what I found on the control network switch for vlan3
> > 
> > IGMP snooping is globally enabled
> > IGMP snooping is enabled on this interface
> > IGMP snooping fast-leave (for v2) is disabled and queried is disabled
> > IGMP snooping explicit-tracking is enabled
> > IGMP snooping last member query response interval is 1000 ms
> > IGMP snooping report-suppression is enabled
> > 
> > 
> > 
> > -----Original Message-----
> > From: Mike Hibler [mailto:mike@flux.utah.edu] 
> > Sent: Wednesday, January 27, 2010 1:54 PM
> > To: Korrie, Donna M CTR USAF AFMC AFRL/RYRD
> > Cc: testbed-admins@flux.utah.edu; Leigh Stoller
> > Subject: Re: [Testbed-admins] Nodes Stuck in reloading
> > 
> > It looks like frisbee is not starting up for you.
> > ...