[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Testbed-admins] Nodes Stuck in reloading
***I do not see a line that corresponds with multicast:
[root@boss:~](1:51pm)#ps
PID TT STAT TIME COMMAND
1502 d0- S 0:53.97 /usr/bin/perl -wT
/usr/testbed/sbin/reload_daemon (perl5.8.8)
1578 d0 Is+ 0:00.00 /usr/libexec/getty std.115200 console
1579 v0 Is+ 0:00.00 /usr/libexec/getty Pc ttyv0
1580 v1 Is+ 0:00.00 /usr/libexec/getty Pc ttyv1
1581 v2 Is+ 0:00.00 /usr/libexec/getty Pc ttyv2
1582 v3 Is+ 0:00.00 /usr/libexec/getty Pc ttyv3
1583 v4 Is+ 0:00.00 /usr/libexec/getty Pc ttyv4
1584 v5 Is+ 0:00.00 /usr/libexec/getty Pc ttyv5
1585 v6 Is+ 0:00.00 /usr/libexec/getty Pc ttyv6
1586 v7 Is+ 0:00.00 /usr/libexec/getty Pc ttyv7
27302 p0 Is 0:00.01 -csh (csh)
28133 p0 I+ 0:00.03 ssh tips
28290 p1 Ss 0:00.01 -csh (csh)
28423 p1 R+ 0:00.00 ps
***Below is the last entry in frisbeed.log
Jan 28 13:38:54 boss frisbeed[28099]: 192.168.1.1 (id 1037960745, image
/usr/testbed/images/FBSD63
+FC8-STD.ndz) joins at 13:38:54! 2 active clients.
Jan 28 13:53:54 boss frisbeed[28097]: No requests for 1800 seconds!
Jan 28 13:53:54 boss frisbeed[28097]: Params:
Jan 28 13:53:54 boss frisbeed[28097]: chunk/block size 1024/1024
Jan 28 13:53:54 boss frisbeed[28097]: burst size/interval 8/1000
Jan 28 13:53:54 boss frisbeed[28097]: file read size 32
Jan 28 13:53:54 boss frisbeed[28097]: file:size
/usr/testbed/images/FBSD63+FC8-STD.ndz
:1616904192
Jan 28 13:53:54 boss frisbeed[28097]: Stats:
Jan 28 13:53:54 boss frisbeed[28097]: service time: 901.052 sec
Jan 28 13:53:54 boss frisbeed[28097]: user/sys CPU time: 0.000/3.604
Jan 28 13:53:54 boss frisbeed[28097]: msgs in/out: 4/2050
Jan 28 13:53:54 boss frisbeed[28097]: joins/leaves: 2/0
Jan 28 13:53:54 boss frisbeed[28097]: requests: 2 (0 merged
in queue)
Jan 28 13:53:54 boss frisbeed[28097]: partial req/blks: 0/0
Jan 28 13:53:54 boss frisbeed[28097]: duplicate req: 0
Jan 28 13:53:54 boss frisbeed[28097]: client re-req: 0
Jan 28 13:53:54 boss frisbeed[28097]: 1k blocks sent: 2048
(-1576960 repeated)
Jan 28 13:53:54 boss frisbeed[28097]: file reads: 64 (2097152
bytes, 18446744072094744576
repeated)
Jan 28 13:53:54 boss frisbeed[28097]: net idle/blocked: 2/0
Jan 28 13:53:54 boss frisbeed[28097]: send intvl/missed: 256/96
Jan 28 13:53:54 boss frisbeed[28097]: spurious wakeups: 0
Jan 28 13:53:54 boss frisbeed[28097]: max workq size: 2
Jan 28 13:53:54 boss frisbeed[28097]: Exiting!
Jan 28 13:53:54 boss frisbeed[28433]: Opened
/usr/testbed/images/FBSD63+FC8-STD.ndz: 1579008 block
s
Jan 28 13:53:54 boss frisbeed[28433]: Maximum send bandwidth 71.552
Mbits/sec (8000 blocks/sec)
Jan 28 13:53:54 boss frisbeed[28433]: Bound to port 7511
Jan 28 13:53:54 boss frisbeed[28433]: Using Multicast
Jan 28 13:53:55 boss frisbeed[28435]: 192.168.1.3 (id 2082084551, image
/usr/testbed/images/FBSD63
***Tcpdump
[root@boss:/usr/testbed/log](2:03pm)#tcpdump -n -i bce0 host
192.168.0.14
tcpdump: verbose output suppressed, use -v or -vv for full protocol
decode
listening on bce0, link-type EN10MB (Ethernet), capture size 96 bytes
14:03:48.364620 IP 192.168.0.14 > 234.5.15.107: igmp v2 report
234.5.15.107
14:03:53.468160 IP 192.168.0.14 > 224.0.0.2: igmp leave 234.5.15.107
14:03:53.468170 IP 192.168.0.14 > 234.5.15.107: igmp v2 report
234.5.15.107
14:03:54.631385 IP 192.168.0.16.669 > 192.168.0.14.855: S
3882426019:3882426019(0) win 65535 <mss 1460,nop,wscale
1,nop,nop,timestamp 1757050326 0,sackOK,eol>
14:03:54.631404 IP 192.168.0.14.855 > 192.168.0.16.669: S
3680384703:3680384703(0) ack 3882426020 win 65535 <mss 1460,nop,wscale
1,nop,nop,timestamp 169790095 1757050326,sackOK,eol>
14:03:54.631514 IP 192.168.0.16.669 > 192.168.0.14.855: . ack 1 win
33304 <nop,nop,timestamp 1757050326 169790095>
14:03:54.631542 IP 192.168.0.16.669 > 192.168.0.14.855: P 1:329(328) ack
1 win 33304 <nop,nop,timestamp 1757050326 169790095>
14:03:54.631752 IP 192.168.0.14.855 > 192.168.0.16.669: F 1:1(0) ack 329
win 33304 <nop,nop,timestamp 169790096 1757050326>
14:03:54.631845 IP 192.168.0.16.669 > 192.168.0.14.855: . ack 2 win
33304 <nop,nop,timestamp 1757050326 169790096>
14:03:54.631879 IP 192.168.0.16.669 > 192.168.0.14.855: F 329:329(0) ack
2 win 33304 <nop,nop,timestamp 1757050326 169790096>
14:03:54.631894 IP 192.168.0.14.855 > 192.168.0.16.669: . ack 330 win
33303 <nop,nop,timestamp 169790096 1757050326>
14:04:01.702899 IP 192.168.0.14 > 234.5.15.107: igmp v2 report
234.5.15.107
14:04:08.428703 IP 192.168.0.14 > 224.0.0.2: igmp leave 234.5.15.107
14:04:08.428713 IP 192.168.0.14 > 234.5.15.107: igmp v2 report
234.5.15.107
14:04:10.064075 IP 192.168.0.14 > 234.5.15.107: igmp v2 report
234.5.15.107
14:04:13.298847 IP 192.168.0.16.666 > 192.168.0.14.855: S
1571107716:1571107716(0) win 65535 <mss 1460,nop,wscale
1,nop,nop,timestamp 1757069091 0,sackOK,eol>
14:04:13.298883 IP 192.168.0.14.855 > 192.168.0.16.666: S
971395781:971395781(0) ack 1571107717 win 65535 <mss 1460,nop,wscale
1,nop,nop,timestamp 169808849 1757069091,sackOK,eol>
14:04:13.298975 IP 192.168.0.16.666 > 192.168.0.14.855: . ack 1 win
33304 <nop,nop,timestamp 1757069091 169808849>
14:04:13.299007 IP 192.168.0.16.666 > 192.168.0.14.855: P 1:329(328) ack
1 win 33304 <nop,nop,timestamp 1757069091 169808849>
***I was looking at the frisbeelauncher log, does not look good:
[root@boss:/usr/testbed/log](2:10pm)#vi /usr/testbed/log/frisbeelauncher
21971: Killed, cleaning up
mysqld went away in process 21971. 0 tries left:
Query: UPDATE emulab_indicies SET idx=LAST_INSERT_ID(idx+1) WHERE name
= 'cur_log_seq'
Error: Server shutdown in progress (1053)
Use of uninitialized value in concatenation (.) or string at
/usr/testbed/lib/libdb.pm line 337.
*** WARNING: frisbeelauncher: Could not log entry to DB:
*** WARNING: frisbeelauncher: dblog failed: Could not log entry to DB:
Mysql
connect('database=tbdb;host=localhost','frisbeelauncher:root:21971',...)
failed: Can't conne
ct to local MySQL server through socket '/tmp/mysql.sock' (2) at
/usr/testbed/lib/emdbi.pm line 15
1
Cannot connect to DB; trying again in 5 seconds!
26184: Killed, cleaning up
27717: Killed, cleaning up
63072: Killed, cleaning up
66995: Killed, cleaning up
Mysql
connect('database=tbdb;host=localhost','frisbeelauncher:root:66995',...)
failed: Can't conne
ct to local MySQL server through socket '/tmp/mysql.sock' (2) at
/usr/testbed/lib/emdbi.pm line 15
1
Cannot connect to DB; trying again in 5 seconds!
6623: Killed, cleaning up
Mysql
connect('database=tbdb;host=localhost','frisbeelauncher:root:6623',...)
failed: Can't connec
t to local MySQL server through socket '/tmp/mysql.sock' (2) at
/usr/testbed/lib/emdbi.pm line 151
Cannot connect to DB; trying again in 5 seconds!
Mysql
connect('database=tbdb;host=localhost','frisbeelauncher:root:6623',...)
failed: Can't connec
t to local MySQL server through socket '/tmp/mysql.sock' (2) at
/usr/testbed/lib/emdbi.pm line 151
Cannot connect to DB; trying again in 5 seconds!
***You had us comment that out
#
# Force multicast keepalives if necessary
#
# ***removed the conditional as per Utah's instruction***
# - Josh 08/19/2009
#if ($ELABINELAB) {
$args .= " -K 15";
#}
I changed that to
if (1 || $ELABINELAB) {
$args .= " -K 15";
}
Do I need to restart anything?
From: Mike Hibler [mailto:mike@flux.utah.edu]
Sent: Thursday, January 28, 2010 1:47 PM
To: Korrie, Donna M CTR USAF AFMC AFRL/RYRD
Cc: Mike Hibler; testbed-admins@flux.utah.edu; Leigh Stoller
Subject: Re: [Testbed-admins] Nodes Stuck in reloading
Assuming that bce0 is the interface for the Emulab side and not your
external
facing network, then this is correct.
So free a node from "reloading" (or "hwdown or whereever they are
stuck):
nfree emulab-ops reloading pcXXX
This will force it back through the reload process. One the node
reaches
the state below where it says it is "Using Multicast", do a ps on boss
and
make sure a corresponding "frisbeed" is running and using the same MC
address
and port. Look at /usr/testbed/log/frisbeed.log and see if it saw a
JOIN
request from the node and do a tcpdump to see if there is any traffic
coming
from or going to that node; e.g.
tcpdump -n -i bce0 host 192.168.0.14
Last time we had talked about making sure that the server was doing
"keep alives", but I don't know if we decided that was necessary.
Anyway,
look at frisbeelauncher for:
#
# Force multicast keepalives if necessary
#
if ($ELABINELAB) {
$args .= " -K 15";
}
We were going to change that conditional to "(1 || $ELABINELAB)" just to
force it.
On Thu, Jan 28, 2010 at 01:22:37PM -0500, Korrie, Donna M CTR USAF AFMC
AFRL/RYRD wrote:
> Mike,
>
> When I ran "netstat -ran" this line was there:
>
> 234.0.0/8 link#1 UCS 0 0 bce0
>
> I verified that the lines
> static_routes="frisbee"
> route_frisbee="-net 234.0.0.0/8 -interface bce0"
> are in the /etc/rc.conf file
>
> When I run /etc/rc.d/routing restart this is the results:
> route: writing to routing socket: File exists
> add net 234.0.0.0: gateway bce0: route already in table
> Additional routing options:.
>
> I am still getting the same results when I try to remove a system form
> reloading or reboot the system this is as far as it gets:
> Starting local daemons:Playing Frisbee ...
> Authenticated IPOD enabled from 192.168.0.14/255.255.255.255
> WARNING: kernel limits buffering to 1907 MB
> da0: write-cache already on
> Invalidating old potential superblocks: 63 6281415 22667715 31053645
> Running /etc/testbed/frisbee -S 192.168.0.14 -M 1907 -i 192.168.1.1
-m
> 234.5.0
> Bound to port 7511
> Using Multicast
>
>
> Any other thoughts?
>
>
> -----Original Message-----
> From: Mike Hibler [mailto:mike@flux.utah.edu]
> Sent: Wednesday, January 27, 2010 3:44 PM
> To: Korrie, Donna M CTR USAF AFMC AFRL/RYRD
> Cc: Mike Hibler; testbed-admins@flux.utah.edu; Leigh Stoller
> Subject: Re: [Testbed-admins] Nodes Stuck in reloading
>
> I am trying to recall for sure, but I think we have been here
before...
> Ah, went looking back at old mail. Last time you had this issue it
was
> because
> your boss didn't have a route for frisbeed's multicast traffic.
>
> Do a "netstat -ran" on your boss and see if there is a route for
> 234.0.0.0/8.
> What I said previously:
>
> > Add the following to your boss:/etc/rc.conf file:
> >
> > static_routes="frisbee"
> > route_frisbee="-net 234.0.0.0/8 -interface bce0"
> >
> > and then do:
> >
> > sudo /etc/rc.d/routing restart
>
> I am pretty sure we decided you didn't need a multicast router.
>
> On Wed, Jan 27, 2010 at 02:45:42PM -0500, Korrie, Donna M CTR USAF
AFMC
> AFRL/RYRD wrote:
> > Here is what I found on the control network switch for vlan3
> >
> > IGMP snooping is globally enabled
> > IGMP snooping is enabled on this interface
> > IGMP snooping fast-leave (for v2) is disabled and queried is
disabled
> > IGMP snooping explicit-tracking is enabled
> > IGMP snooping last member query response interval is 1000 ms
> > IGMP snooping report-suppression is enabled
> >
> >
> >
> > -----Original Message-----
> > From: Mike Hibler [mailto:mike@flux.utah.edu]
> > Sent: Wednesday, January 27, 2010 1:54 PM
> > To: Korrie, Donna M CTR USAF AFMC AFRL/RYRD
> > Cc: testbed-admins@flux.utah.edu; Leigh Stoller
> > Subject: Re: [Testbed-admins] Nodes Stuck in reloading
> >
> > It looks like frisbee is not starting up for you.
> > ...