[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Testbed-admins] Nodes Stuck in reloading



***I do not see a line that corresponds with multicast:

[root@boss:~](1:51pm)#ps
  PID  TT  STAT      TIME COMMAND
 1502  d0- S      0:53.97 /usr/bin/perl -wT
/usr/testbed/sbin/reload_daemon (perl5.8.8)
 1578  d0  Is+    0:00.00 /usr/libexec/getty std.115200 console
 1579  v0  Is+    0:00.00 /usr/libexec/getty Pc ttyv0
 1580  v1  Is+    0:00.00 /usr/libexec/getty Pc ttyv1
 1581  v2  Is+    0:00.00 /usr/libexec/getty Pc ttyv2
 1582  v3  Is+    0:00.00 /usr/libexec/getty Pc ttyv3
 1583  v4  Is+    0:00.00 /usr/libexec/getty Pc ttyv4
 1584  v5  Is+    0:00.00 /usr/libexec/getty Pc ttyv5
 1585  v6  Is+    0:00.00 /usr/libexec/getty Pc ttyv6
 1586  v7  Is+    0:00.00 /usr/libexec/getty Pc ttyv7
27302  p0  Is     0:00.01 -csh (csh)
28133  p0  I+     0:00.03 ssh tips
28290  p1  Ss     0:00.01 -csh (csh)
28423  p1  R+     0:00.00 ps

***Below is the last entry in frisbeed.log
Jan 28 13:38:54 boss frisbeed[28099]: 192.168.1.1 (id 1037960745, image
/usr/testbed/images/FBSD63
+FC8-STD.ndz) joins at 13:38:54!  2 active clients.
Jan 28 13:53:54 boss frisbeed[28097]: No requests for 1800 seconds!
Jan 28 13:53:54 boss frisbeed[28097]: Params:
Jan 28 13:53:54 boss frisbeed[28097]:   chunk/block size    1024/1024
Jan 28 13:53:54 boss frisbeed[28097]:   burst size/interval 8/1000
Jan 28 13:53:54 boss frisbeed[28097]:   file read size      32
Jan 28 13:53:54 boss frisbeed[28097]:   file:size
/usr/testbed/images/FBSD63+FC8-STD.ndz
:1616904192
Jan 28 13:53:54 boss frisbeed[28097]: Stats:
Jan 28 13:53:54 boss frisbeed[28097]:   service time:      901.052 sec
Jan 28 13:53:54 boss frisbeed[28097]:   user/sys CPU time: 0.000/3.604
Jan 28 13:53:54 boss frisbeed[28097]:   msgs in/out:       4/2050
Jan 28 13:53:54 boss frisbeed[28097]:   joins/leaves:      2/0
Jan 28 13:53:54 boss frisbeed[28097]:   requests:          2 (0 merged
in queue)
Jan 28 13:53:54 boss frisbeed[28097]:   partial req/blks:  0/0
Jan 28 13:53:54 boss frisbeed[28097]:   duplicate req:     0
Jan 28 13:53:54 boss frisbeed[28097]:   client re-req:     0
Jan 28 13:53:54 boss frisbeed[28097]:   1k blocks sent:    2048
(-1576960 repeated)
Jan 28 13:53:54 boss frisbeed[28097]:   file reads:        64 (2097152
bytes, 18446744072094744576
 repeated)
Jan 28 13:53:54 boss frisbeed[28097]:   net idle/blocked:  2/0
Jan 28 13:53:54 boss frisbeed[28097]:   send intvl/missed: 256/96
Jan 28 13:53:54 boss frisbeed[28097]:   spurious wakeups:  0
Jan 28 13:53:54 boss frisbeed[28097]:   max workq size:    2
Jan 28 13:53:54 boss frisbeed[28097]: Exiting!
Jan 28 13:53:54 boss frisbeed[28433]: Opened
/usr/testbed/images/FBSD63+FC8-STD.ndz: 1579008 block
s
Jan 28 13:53:54 boss frisbeed[28433]: Maximum send bandwidth 71.552
Mbits/sec (8000 blocks/sec)
Jan 28 13:53:54 boss frisbeed[28433]: Bound to port 7511
Jan 28 13:53:54 boss frisbeed[28433]: Using Multicast
Jan 28 13:53:55 boss frisbeed[28435]: 192.168.1.3 (id 2082084551, image
/usr/testbed/images/FBSD63

***Tcpdump
[root@boss:/usr/testbed/log](2:03pm)#tcpdump -n -i bce0 host
192.168.0.14
tcpdump: verbose output suppressed, use -v or -vv for full protocol
decode
listening on bce0, link-type EN10MB (Ethernet), capture size 96 bytes
14:03:48.364620 IP 192.168.0.14 > 234.5.15.107: igmp v2 report
234.5.15.107
14:03:53.468160 IP 192.168.0.14 > 224.0.0.2: igmp leave 234.5.15.107
14:03:53.468170 IP 192.168.0.14 > 234.5.15.107: igmp v2 report
234.5.15.107
14:03:54.631385 IP 192.168.0.16.669 > 192.168.0.14.855: S
3882426019:3882426019(0) win 65535 <mss 1460,nop,wscale
1,nop,nop,timestamp 1757050326 0,sackOK,eol>
14:03:54.631404 IP 192.168.0.14.855 > 192.168.0.16.669: S
3680384703:3680384703(0) ack 3882426020 win 65535 <mss 1460,nop,wscale
1,nop,nop,timestamp 169790095 1757050326,sackOK,eol>
14:03:54.631514 IP 192.168.0.16.669 > 192.168.0.14.855: . ack 1 win
33304 <nop,nop,timestamp 1757050326 169790095>
14:03:54.631542 IP 192.168.0.16.669 > 192.168.0.14.855: P 1:329(328) ack
1 win 33304 <nop,nop,timestamp 1757050326 169790095>
14:03:54.631752 IP 192.168.0.14.855 > 192.168.0.16.669: F 1:1(0) ack 329
win 33304 <nop,nop,timestamp 169790096 1757050326>
14:03:54.631845 IP 192.168.0.16.669 > 192.168.0.14.855: . ack 2 win
33304 <nop,nop,timestamp 1757050326 169790096>
14:03:54.631879 IP 192.168.0.16.669 > 192.168.0.14.855: F 329:329(0) ack
2 win 33304 <nop,nop,timestamp 1757050326 169790096>
14:03:54.631894 IP 192.168.0.14.855 > 192.168.0.16.669: . ack 330 win
33303 <nop,nop,timestamp 169790096 1757050326>
14:04:01.702899 IP 192.168.0.14 > 234.5.15.107: igmp v2 report
234.5.15.107
14:04:08.428703 IP 192.168.0.14 > 224.0.0.2: igmp leave 234.5.15.107
14:04:08.428713 IP 192.168.0.14 > 234.5.15.107: igmp v2 report
234.5.15.107
14:04:10.064075 IP 192.168.0.14 > 234.5.15.107: igmp v2 report
234.5.15.107
14:04:13.298847 IP 192.168.0.16.666 > 192.168.0.14.855: S
1571107716:1571107716(0) win 65535 <mss 1460,nop,wscale
1,nop,nop,timestamp 1757069091 0,sackOK,eol>
14:04:13.298883 IP 192.168.0.14.855 > 192.168.0.16.666: S
971395781:971395781(0) ack 1571107717 win 65535 <mss 1460,nop,wscale
1,nop,nop,timestamp 169808849 1757069091,sackOK,eol>
14:04:13.298975 IP 192.168.0.16.666 > 192.168.0.14.855: . ack 1 win
33304 <nop,nop,timestamp 1757069091 169808849>
14:04:13.299007 IP 192.168.0.16.666 > 192.168.0.14.855: P 1:329(328) ack
1 win 33304 <nop,nop,timestamp 1757069091 169808849>

***I was looking at the frisbeelauncher log, does not look good:
[root@boss:/usr/testbed/log](2:10pm)#vi /usr/testbed/log/frisbeelauncher
21971: Killed, cleaning up
mysqld went away in process 21971. 0 tries left:
  Query: UPDATE emulab_indicies SET idx=LAST_INSERT_ID(idx+1) WHERE name
= 'cur_log_seq'
  Error: Server shutdown in progress (1053)
Use of uninitialized value in concatenation (.) or string at
/usr/testbed/lib/libdb.pm line 337.
*** WARNING: frisbeelauncher: Could not log entry to DB:
*** WARNING: frisbeelauncher: dblog failed: Could not log entry to DB:
Mysql
connect('database=tbdb;host=localhost','frisbeelauncher:root:21971',...)
failed: Can't conne
ct to local MySQL server through socket '/tmp/mysql.sock' (2) at
/usr/testbed/lib/emdbi.pm line 15
1
Cannot connect to DB; trying again in 5 seconds!
26184: Killed, cleaning up
27717: Killed, cleaning up
63072: Killed, cleaning up
66995: Killed, cleaning up
Mysql
connect('database=tbdb;host=localhost','frisbeelauncher:root:66995',...)
failed: Can't conne
ct to local MySQL server through socket '/tmp/mysql.sock' (2) at
/usr/testbed/lib/emdbi.pm line 15
1
Cannot connect to DB; trying again in 5 seconds!
6623: Killed, cleaning up
Mysql
connect('database=tbdb;host=localhost','frisbeelauncher:root:6623',...)
failed: Can't connec
t to local MySQL server through socket '/tmp/mysql.sock' (2) at
/usr/testbed/lib/emdbi.pm line 151
Cannot connect to DB; trying again in 5 seconds!
Mysql
connect('database=tbdb;host=localhost','frisbeelauncher:root:6623',...)
failed: Can't connec
t to local MySQL server through socket '/tmp/mysql.sock' (2) at
/usr/testbed/lib/emdbi.pm line 151
Cannot connect to DB; trying again in 5 seconds!


***You had us comment that out

#
# Force multicast keepalives if necessary
#
# ***removed the conditional as per Utah's instruction***
# - Josh 08/19/2009
#if ($ELABINELAB) {
    $args .= " -K 15";
#}

I changed that to
if (1 || $ELABINELAB) {
    $args .= " -K 15";
}


Do I need to restart anything?


From: Mike Hibler [mailto:mike@flux.utah.edu] 
Sent: Thursday, January 28, 2010 1:47 PM
To: Korrie, Donna M CTR USAF AFMC AFRL/RYRD
Cc: Mike Hibler; testbed-admins@flux.utah.edu; Leigh Stoller
Subject: Re: [Testbed-admins] Nodes Stuck in reloading

Assuming that bce0 is the interface for the Emulab side and not your
external
facing network, then this is correct.

So free a node from "reloading" (or "hwdown or whereever they are
stuck):

   nfree emulab-ops reloading pcXXX

This will force it back through the reload process.  One the node
reaches
the state below where it says it is "Using Multicast", do a ps on boss
and
make sure a corresponding "frisbeed" is running and using the same MC
address
and port.  Look at /usr/testbed/log/frisbeed.log and see if it saw a
JOIN
request from the node and do a tcpdump to see if there is any traffic
coming
from or going to that node; e.g.

   tcpdump -n -i bce0 host 192.168.0.14

Last time we had talked about making sure that the server was doing
"keep alives", but I don't know if we decided that was necessary.
Anyway,
look at frisbeelauncher for:

#
# Force multicast keepalives if necessary
#
if ($ELABINELAB) {
    $args .= " -K 15";
}

We were going to change that conditional to "(1 || $ELABINELAB)" just to
force it.

On Thu, Jan 28, 2010 at 01:22:37PM -0500, Korrie, Donna M CTR USAF AFMC
AFRL/RYRD wrote:
> Mike,
> 
> When I ran "netstat -ran" this line was there:
> 
> 234.0.0/8          link#1             UCS         0        0   bce0
> 
> I verified that the lines
> static_routes="frisbee"
> route_frisbee="-net 234.0.0.0/8 -interface bce0"
> are in the /etc/rc.conf file
> 
> When I run /etc/rc.d/routing restart this is the results:
> route: writing to routing socket: File exists
> add net 234.0.0.0: gateway bce0: route already in table
> Additional routing options:.
> 
> I am still getting the same results when I try to remove a system form
> reloading or reboot the system this is as far as it gets: 
> Starting local daemons:Playing Frisbee ...
> Authenticated IPOD enabled from 192.168.0.14/255.255.255.255
> WARNING: kernel limits buffering to 1907 MB
> da0: write-cache already on
> Invalidating old potential superblocks: 63 6281415 22667715 31053645
> Running /etc/testbed/frisbee -S 192.168.0.14 -M 1907   -i 192.168.1.1
-m
> 234.5.0
> Bound to port 7511
> Using Multicast
> 
> 
> Any other thoughts?
> 
> 
> -----Original Message-----
> From: Mike Hibler [mailto:mike@flux.utah.edu] 
> Sent: Wednesday, January 27, 2010 3:44 PM
> To: Korrie, Donna M CTR USAF AFMC AFRL/RYRD
> Cc: Mike Hibler; testbed-admins@flux.utah.edu; Leigh Stoller
> Subject: Re: [Testbed-admins] Nodes Stuck in reloading
> 
> I am trying to recall for sure, but I think we have been here
before...
> Ah, went looking back at old mail.  Last time you had this issue it
was
> because
> your boss didn't have a route for frisbeed's multicast traffic.
> 
> Do a "netstat -ran" on your boss and see if there is a route for
> 234.0.0.0/8.
> What I said previously:
> 
> > Add the following to your boss:/etc/rc.conf file:
> > 
> >   static_routes="frisbee"
> >   route_frisbee="-net 234.0.0.0/8 -interface bce0"
> > 
> > and then do:
> > 
> >   sudo /etc/rc.d/routing restart
> 
> I am pretty sure we decided you didn't need a multicast router.
> 
> On Wed, Jan 27, 2010 at 02:45:42PM -0500, Korrie, Donna M CTR USAF
AFMC
> AFRL/RYRD wrote:
> > Here is what I found on the control network switch for vlan3
> > 
> > IGMP snooping is globally enabled
> > IGMP snooping is enabled on this interface
> > IGMP snooping fast-leave (for v2) is disabled and queried is
disabled
> > IGMP snooping explicit-tracking is enabled
> > IGMP snooping last member query response interval is 1000 ms
> > IGMP snooping report-suppression is enabled
> > 
> > 
> > 
> > -----Original Message-----
> > From: Mike Hibler [mailto:mike@flux.utah.edu] 
> > Sent: Wednesday, January 27, 2010 1:54 PM
> > To: Korrie, Donna M CTR USAF AFMC AFRL/RYRD
> > Cc: testbed-admins@flux.utah.edu; Leigh Stoller
> > Subject: Re: [Testbed-admins] Nodes Stuck in reloading
> > 
> > It looks like frisbee is not starting up for you.
> > ...