[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Testbed-admins] Can not bring up the first node for a new Emulab site





On Thu, Aug 6, 2009 at 11:01 AM, Mike Hibler <mike@flux.utah.edu> wrote:
So you have a couple of things going on.  One it appears there must be
an DHCP server running on whatever network de0 is attached to.  The whole
DHCP-to-find-the-control-net technique is fraught with peril and only
works reliably if there is a single attached network with a DHCP server on
it.  You need to look into that,

I think you are right. The de0 was attached to the experiment switch which is connected to control switch via a cable. I think de0 could also hear from the DHCP server running on Boss via the cable connecting two switches. So I get rid off that cable for testing now.

but it is not the problem that is stopping
you here, because the determination of who is "boss" is different and
independent of the control net.

By default, boss is determined by who the node's DNS server is.  This is
apparently set correctly because the node was able to query boss and get
the "IPOD" info (I assume that 192.168.56.3 is your boss node).

The "unable to get address" is caused by boss not returning the correct
"loadinfo".  The first thing to try is just to force the reload-related DB
state of the node to get reset by doing:

 nfree emulab-ops reloading pcXXX

This exactly allow the node to load disk image successfully!

Then after the image was loaded it rebooted to,
-------------------------------------------------
Type a key for interactive mode (quick, quick!)
Entering boot wait mode. Type ^C for interactive mode ...
-------------------------------------------------
As of now I think everything works. The website shows there is one free PC. Then I do ahead to create my first experiment (simplest one node experiment),
And it give me the following error,
-------------------------------------------------
Running 'tbswap in  testbed test'
Beginning swap-in for testbed/test (19). 08/06/2009 15:27:03
TIMESTAMP: 15:27:03:891649 tbswap in started
Checking with Admission Control ...
Mapping to physical reality ...
TIMESTAMP: 15:27:03:914402 mapper wrapper started
Starting the new and improved mapper wrapper.
Clearing physical state before updating.
Minimum nodes = 1
Maximum nodes = 1
Assign run 1
ptopargs: '-p testbed -e test '
assign command: 'assign -P testbed-test-79659.ptop testbed-test-79659.vtop'
Reading assign results.
Successfully reserved all physical nodes we needed.
TIMESTAMP: 15:27:07:886102 mapper wrapper finished
Mapped to physical reality!
Fetching tarballs and RPMs (if any) ...
TIMESTAMP: 15:27:07:894153 tarfiles_setup started
TIMESTAMP: 15:27:08:440888 tarfiles_setup finished
Setting up mountpoints.
TIMESTAMP: 15:27:08:449751 mountpoints started
TIMESTAMP: 15:27:11:941438 mountpoints finished
TIMESTAMP: 15:27:11:946914 named started
Setting up named maps.
TIMESTAMP: 15:27:12:548057 named finished
TIMESTAMP: 15:27:12:553921 gentopofile started
Generating ltmap (again) ...
TIMESTAMP: 15:27:13:169962 gentopofile finished
Resetting OS and rebooting.
TIMESTAMP: 15:27:13:176674 launching os_setup
Setting up VLANs.
TIMESTAMP: 15:27:13:207860 snmpit started
snmpit: testbed/test has no VLANs to create, skipping
TIMESTAMP: 15:27:14:242266 snmpit finished
Setting up email lists.
TIMESTAMP: 15:27:14:246705 genelists started
TIMESTAMP: 15:27:14:355425 os_setup started
Mapping [OS 10076: emulab-ops,RHL-STD] on pc1 to [OS 10077: emulab-ops,FC6-STD].
TIMESTAMP: 15:27:14:872836 rebooting/reloading nodes started
TIMESTAMP: 15:27:15:236015 genelists finished
Clearing port counters.
TIMESTAMP: 15:27:15:240001 portstats started
reboot (pc1): Attempting to reboot ...
reboot (pc1): Successful!
reboot: Done. There were 0 failures.
TIMESTAMP: 15:27:15:788376 portstats finished
reboot (pc1): child returned 0 status.
TIMESTAMP: 15:27:16:895773 rebooting/reloading finished
Waiting for local testbed nodes to finish rebooting ...
TIMESTAMP: 15:27:16:899892 Local node waiting started
*** WARNING: os_setup: pc1 reported a TBFAILED event; not retrying
TIMESTAMP: 15:27:54:280751 Local node waiting finished
OS Setup Done.
*** ERROR: os_setup:
*** There were 1 failed nodes.
***
*** 1/1 pc932's with a system osid of "FC6-STD" failed to boot: node0(pc1)
TIMESTAMP: 15:27:54:332529 os_setup finished
*** ERROR: tbswap: Failed to reset OS and reboot nodes.
Cleaning up after errors; will try again.
Stopping the event system
TIMESTAMP: 15:27:55:883059 snmpit started
Removing VLANs.
snmpit: testbed/test has no VLANs to remove, skipping
TIMESTAMP: 15:27:56:488187 snmpit finished
Freeing failed nodes.
TIMESTAMP: 15:27:56:497186 nfree started
Moving [Node: pc1] to [Experiment: emulab-ops/hwdown]
TIMESTAMP: 15:27:57:465283 nfree finished
Trying again...
Mapping to physical reality ...
TIMESTAMP: 15:27:57:478082 mapper wrapper started
Starting the new and improved mapper wrapper.
Clearing physical state before updating.
Minimum nodes = 1
Maximum nodes = 1
Assign run 1
ptopargs: '-p testbed -e test '
assign command: 'assign -P testbed-test-79717.ptop testbed-test-79717.vtop'
*** ERROR: mapper: Unretriable error. Giving up.
*** ERROR: tbswap: Failed (65) to map to reality.
Cleaning up after errors.
Stopping the event system
TIMESTAMP: 15:28:02:136708 snmpit started
Removing VLANs.
snmpit: testbed/test has no VLANs to remove, skipping
TIMESTAMP: 15:28:02:754285 snmpit finished
Tearing down virtual nodes.
TIMESTAMP: 15:28:02:760556 vnode_setup -k started
*** WARNING: vnode_setup: No allocated nodes in experiment testbed/test.
TIMESTAMP: 15:28:03:374169 vnode_setup finished
Freeing nodes.
TIMESTAMP: 15:28:03:381223 nfree started
Releasing all nodes from experiment [Experiment: testbed/test].
TIMESTAMP: 15:28:03:847820 nfree finished
Resetting DB.
Failingly finished swap-in for testbed/test. 15:28:03:864696
TIMESTAMP: 15:28:03:866206 tbswap in finished (failed)
*** ERROR: swapexp: tbswap in failed!
Cleaning up and exiting with status 1 ...
**** Experimental information, please ignore ****
Session ID = 33047
Likely Cause of the Problem:
There were 1 failed nodes.

1/1 pc932's with a system osid of "FC6-STD" failed to boot: node0(pc1)
Cause: unknown
Confidence: 0.7
Script: os_setup
**** End experimental information ****
----------------------------------------------------------------------

At this time, the node was booted into the default OS (FC6-STD), it showed,
-----------------------------------------------------------------
Fedora Core release 6 (Zod)
Kernel 2.6.20.6emulab on an i686

node0 login:

-------------------------------------------------------------------
I checked the database and pc1 was put into the emulab-ops/hwdown at this point.  Any idea why the experiment swap in failed?


Thanks,
Evan Zhang


 

(where pcXXX is your node name) This will force the node back through the
reloading process.

If that doesn't change anything, look at /usr/testbed/log/tmcd.log on boss
and see what it says.  After an attempted boot of this node you should see
something like:

 Aug  6 08:49:02 boss tmcd[71339]: pc236: vers:30 TCP loadinfo

and then if it was working:

 Aug  6 08:49:02 boss tmcd[71339]: pc236: loadinfo wrote 105 bytes

If it claims to be returning the info, then you should be able to login
to the node on the console as root (you cannot ssh in) and run:

 /etc/testbed/tmcc loadinfo

and see what it returns.

On Thu, Aug 06, 2009 at 10:31:49AM -0400, chunhui Zhang(Evan) wrote:
> On Wed, Aug 5, 2009 at 8:03 PM, Mike Hibler <mike@flux.utah.edu> wrote:
>
> > Any chance you are getting a DHCP reply over de0?
> >
>
> This time I disconnected the de0, it showed,
> ---------------------------------------------
> Setting hostname: .
> Emulab looking for control net among: de0 xl0 ...
> xl0: link state changed to UP
> Terminated
> *Emulab control net is xl0*
> ...
> ...
> ...
> Starting local daemons: Playing Frisbee ...
> *de0: autosense failed: cable problem?*
> Authenticated IPOD enabled from 192.168.56.3/255.255.255.255
> Unable to get address for loading image
> Failed to load disk, dropping to login prompt at Wed Aug  6 08:11:43 MDT
> 2009
> --------------------------------------------------------
>
> >
> > There should be some logfiles from the DHCP process in /var/tmp.
> > What is in /var/tmp/netif-emulab.log?
>
>
> I did 'cat /var/tmp/netif-emulab.log' and it showed,
> --------------------------------------------------------
> *Using dhclient port...*
> --------------------------------------------------------
>
> Thanks for the help,
> Evan Zhang
>
> >
> >
> > On Wed, Aug 05, 2009 at 07:52:21PM -0400, chunhui Zhang(Evan) wrote:
> > > I got to the point that,
> > > ----------------------------------------
> > > Attempting boot of: /tftpboot/frisbee
> > > Loading /boot/defaults/loader.conf
> > > /boot/kernel text= ............................................etc
> > > /boot/acpi.ko text= ..........................................etc
> > >
> > > Hit [Enter] to boot immediately, or any other key for command prompt.
> > > Booting [/boot/kernel]...
> > > ...
> > > ...
> > > ...
> > > Setting hostname: .
> > > Emulab looking for control net among: de0 xl0 ...
> > > xl0: link state changed to UP
> > > Emulab control net is de0
> > > ...
> > > ...
> > > ...
> > > Starting local daemons: Playing Frisbee ...
> > > Authenticated IPOD enabled from 192.168.56.3/255.255.255.255
> > > Unable to get address for loading image
> > > Failed to load disk, dropping to login prompt at Wed Aug  5 17:36:44 MDT
> > > 2009
> > > ...
> > > ...
> > > ...
> > > FreeBSD/i386 (pc1) (console)
> > >
> > > login:
> > >
> > -------------------------------------------------------------------------------------
> > >
> > > Above message tells me the control net is de0 which is not true. xl0 is
> > the
> > > one I used as control NIC and the DB also have the knowledge that. My
> > node
> > > was booted from xl0 and de0 do not have pxe boot ability at all.  Then I
> > > logged in through the dropped prompt and did 'cat
> > /etc/testbed/controlif',
> > > it showed 'xl0'. I traced through the source code and find out 'tmcc
> > > loadinfo' gave me nothing. I am not sure how to debug this issue further.
> > So
> > > I am wondering do you see the similar issue before? And any suggestions?
> > >
> > > Thanks a lot,
> > > Evan Zhang
> > >
> > >
> > > On Tue, Aug 4, 2009 at 10:42 PM, chunhui Zhang(Evan) <chyz198@gmail.com
> > >wrote:
> > >
> > > >
> > > >
> > > > On Tue, Aug 4, 2009 at 10:26 PM, Mike Hibler <mike@flux.utah.edu>
> > wrote:
> > > >
> > > >> On Tue, Aug 04, 2009 at 09:40:45PM -0400, chunhui Zhang(Evan) wrote:
> > > >> > On Mon, Aug 3, 2009 at 6:05 PM, Mike Hibler <mike@flux.utah.edu>
> > wrote:
> > > >> >
> > > >> > > I will jump in the middle here, please excuse me if you have
> > already
> > > >> > > answered
> > > >> > > any of my questions!
> > > >> > >
> > > >> > > In one of your earlier messages you had some text that "came out",
> > was
> > > >> >
> > > >> >
> > > >> > > that on the VGA or serial line?  It doesn't matter, as long as we
> > have
> > > >> a
> > > >> > > console that works.
> > > >> > >
> > > >> >
> > > >> > >
> > > >> > > Do I understand correctly that the 47 MFS boots on the PIII node
> > but
> > > >> the
> > > >> > > 62 version does not?  Strange...
> > > >> > >
> > > >> > > No matter, let's go back to the 62 MFS, the 47 is a loser long
> > term.
> > > >> > >  Verify
> > > >> > > that you have the right console selected in
> > > >> > > /tftpboot/freebsd.newnode/boot/loader.conf.orig
> > > >> >
> > > >> >
> > > >> > By changing the option 'console="comconsole"' to
> > 'console="vidconsole"'
> > > >> in
> > > >> > loader.conf.orig solved my problem!
> > > >> > Thank you so much!  BTW, where can I find the document talking about
> > > >> this
> > > >> > configuration?
> > > >> >
> > > >> >
> > > >> > Evan Zhang
> > > >> >
> > > >>
> > > >> How much of your problem did it solve?  Will the MFS boot on the newer
> > > >> node?
> > > >>
> > > >
> > > >  Yes, all three MFSs can boot now.
> > > >
> > > >>
> > > >> Re: documentation, I thought that console config was in the README for
> > the
> > > >> MFS
> > > >> but it isn't!  We need to fix that.
> > > >
> > > >
> > > > Evan
> > > >
> > > >
> > > >
> >
> > > _______________________________________________
> > > Testbed-admins mailing list
> > > Testbed-admins@flux.utah.edu
> > > http://www.flux.utah.edu/mailman/listinfo/testbed-admins
> >
> >