[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Testbed-admins] Nodes Stuck in reloading
Assuming that bce0 is the interface for the Emulab side and not your external
facing network, then this is correct.
So free a node from "reloading" (or "hwdown or whereever they are stuck):
nfree emulab-ops reloading pcXXX
This will force it back through the reload process. One the node reaches
the state below where it says it is "Using Multicast", do a ps on boss and
make sure a corresponding "frisbeed" is running and using the same MC address
and port. Look at /usr/testbed/log/frisbeed.log and see if it saw a JOIN
request from the node and do a tcpdump to see if there is any traffic coming
from or going to that node; e.g.
tcpdump -n -i bce0 host 192.168.0.14
Last time we had talked about making sure that the server was doing
"keep alives", but I don't know if we decided that was necessary. Anyway,
look at frisbeelauncher for:
#
# Force multicast keepalives if necessary
#
if ($ELABINELAB) {
$args .= " -K 15";
}
We were going to change that conditional to "(1 || $ELABINELAB)" just to
force it.
On Thu, Jan 28, 2010 at 01:22:37PM -0500, Korrie, Donna M CTR USAF AFMC AFRL/RYRD wrote:
> Mike,
>
> When I ran "netstat -ran" this line was there:
>
> 234.0.0/8 link#1 UCS 0 0 bce0
>
> I verified that the lines
> static_routes="frisbee"
> route_frisbee="-net 234.0.0.0/8 -interface bce0"
> are in the /etc/rc.conf file
>
> When I run /etc/rc.d/routing restart this is the results:
> route: writing to routing socket: File exists
> add net 234.0.0.0: gateway bce0: route already in table
> Additional routing options:.
>
> I am still getting the same results when I try to remove a system form
> reloading or reboot the system this is as far as it gets:
> Starting local daemons:Playing Frisbee ...
> Authenticated IPOD enabled from 192.168.0.14/255.255.255.255
> WARNING: kernel limits buffering to 1907 MB
> da0: write-cache already on
> Invalidating old potential superblocks: 63 6281415 22667715 31053645
> Running /etc/testbed/frisbee -S 192.168.0.14 -M 1907 -i 192.168.1.1 -m
> 234.5.0
> Bound to port 7511
> Using Multicast
>
>
> Any other thoughts?
>
>
> -----Original Message-----
> From: Mike Hibler [mailto:mike@flux.utah.edu]
> Sent: Wednesday, January 27, 2010 3:44 PM
> To: Korrie, Donna M CTR USAF AFMC AFRL/RYRD
> Cc: Mike Hibler; testbed-admins@flux.utah.edu; Leigh Stoller
> Subject: Re: [Testbed-admins] Nodes Stuck in reloading
>
> I am trying to recall for sure, but I think we have been here before...
> Ah, went looking back at old mail. Last time you had this issue it was
> because
> your boss didn't have a route for frisbeed's multicast traffic.
>
> Do a "netstat -ran" on your boss and see if there is a route for
> 234.0.0.0/8.
> What I said previously:
>
> > Add the following to your boss:/etc/rc.conf file:
> >
> > static_routes="frisbee"
> > route_frisbee="-net 234.0.0.0/8 -interface bce0"
> >
> > and then do:
> >
> > sudo /etc/rc.d/routing restart
>
> I am pretty sure we decided you didn't need a multicast router.
>
> On Wed, Jan 27, 2010 at 02:45:42PM -0500, Korrie, Donna M CTR USAF AFMC
> AFRL/RYRD wrote:
> > Here is what I found on the control network switch for vlan3
> >
> > IGMP snooping is globally enabled
> > IGMP snooping is enabled on this interface
> > IGMP snooping fast-leave (for v2) is disabled and queried is disabled
> > IGMP snooping explicit-tracking is enabled
> > IGMP snooping last member query response interval is 1000 ms
> > IGMP snooping report-suppression is enabled
> >
> >
> >
> > -----Original Message-----
> > From: Mike Hibler [mailto:mike@flux.utah.edu]
> > Sent: Wednesday, January 27, 2010 1:54 PM
> > To: Korrie, Donna M CTR USAF AFMC AFRL/RYRD
> > Cc: testbed-admins@flux.utah.edu; Leigh Stoller
> > Subject: Re: [Testbed-admins] Nodes Stuck in reloading
> >
> > It looks like frisbee is not starting up for you.
> > ...