[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Testbed-admins] Can't bring up the first node



Bringing this back "online" to testbed-admins...

As you recall, the problem was that the MFS BSD kernel was booting but
only finding 640K of physical memory and thus panicing.  Turns out the
kernel generally trusts what the boot loader tells it, so it was the
loader (pxeboot) that was getting and passing bad intel.

Well, I ultimately got to the point where I sent Derek a version of pxeboot
with some debugging printfs in it to figure out what was going on and,
of course, it started working just fine at that point.  Just like last time
I went after this.

At the same time I had fixed one minor memory overwrite that should have
just been overwriting extra buffer space and rearranged some code, so I
took the debugging back out to see what happened, and it still works.  So
I suspect all I really did was move the bug somewhere less critical.  Since
we will be moving to an all new Linux-based bootstrap soon anyway, I am not
feeling too guilty about this.

Anyway, the new "improved" versions of pxeboot are now out at:

	http://www.emulab.net/downloads/pxeboot62a.tar.gz

No need to upgrade unless you experience problems.

On Thu, Aug 13, 2009 at 04:45:01PM -0400, Espinola, Derek wrote:
> Yes, I did put the kernel and acpi.ko module in /tftpboot..... And reran prepare. And it pretty much blew up on loading.
> 
> I tried the 6.4 kernel with the mod to loader.conf.orig .  In verbose mode I  did not observe messages like:
> 
>   SMAP type=01 base=0000000000000000 len=000000000008d000
> 
> It basically went through the initial  loading and then a bunch of MADT found ....CPU id ....
> Then cpu info then:
> 
> Real memory = 655360 (0 MB)
> Physical memory chunks (s):
> 0x0000000000001000 - 0x000000000005bfff, 372736 bytes (91 pages)
> Avail memory = 114699 (0 MB)
> bios32 : Found BIOS32 Service Directory header at 0xc00ffe80
> bios32: Entry = 0xffe90 (c00ffe90) rev = 0 Len = 1
> Pcibios: PCI BIOS entry at 0xf0000+0xb02e
> pnpbios: Found PnP BIOS data at 0xc00fe2d0
> pnpbios: Entry = f0000:e2f4 Rev = 1.0
> Other Bios signatures found:
> APIC: CPU 0 has ACPI ID 1
> Panic : hashinit: bad elements
> KDB: enter : panic
> [thread pid 0 tid 0 }
> Stopped at    kdb_enter+0x2b: nop
> Db>
> 
> 
> On 8/13/09 3:48 PM, "Mike Hibler" <mike@flux.utah.edu> wrote:
> 
> I'll take that as a step in the wrong direction. :-)
> 
> So you put the kernel and the acpi.ko module out in the /tftpboot/blahblah
> directory and reran "prepare"?  Did it blow up first thing?
> 
> We'll get together a Linux-based boot environment that you can install...
> 
> But in the meantime, try one (really two) last thing for me.  Go back to
> the 6.4 kernel.  In the loader.conf.orig file put:
> 
>   hw.hasbrokenint12=1
> 
> at the end. (This sets some hack for newer machines that cannot use some
> older BIOS call to get memory info.)  Rerun the prepare script and reboot
> the node.  At the same time, we are going to turn on a verbose boot.
> When the node reaches the point:
> 
>   Type a key for interactive mode (quick, quick!)
>   Attempting boot of: 155.98.32.70:/tftpboot/freebsd7-sio-acpi
>   Loading /boot/defaults/loader.conf
> 
> as soon as it starts that "Loading" phase, type a space.  It will finish
> loading the modules and then should prompt you with:
> 
>   Type '?' for a list of commands, 'help' for more detailed help.
>   OK
> 
> type "boot -v" and hit return.  The first thing you should see are a bunch
> of messages like:
> 
>   SMAP type=01 base=0000000000000000 len=000000000008d000
>   SMAP type=02 base=00000000000f0000 len=0000000000010000
>   SMAP type=01 base=0000000000100000 len=000000007f4ffc00
>   ...
> 
> and then later:
> 
>   real memory  = 2136993792 (2037 MB)
>   Physical memory chunk(s):
>   0x0000000000001000 - 0x000000000008bfff, 569344 bytes (139 pages)
>   0x0000000000100000 - 0x00000000003fffff, 3145728 bytes (768 pages)
>   0x0000000002c25000 - 0x000000007d1d6fff, 2052792320 bytes (501170 pages)
>   avail memory = 2052218880 (1957 MB)
>   ...
> 
> But I suspect it is not reading the BIOS memory map correctly, so you may
> not see any of this.
> 
> On Thu, Aug 13, 2009 at 02:34:45PM -0400, Espinola, Derek wrote:
> > Using the 7.2 kernel gives different result.
> >
> > Fatal trap 12: page fault while in kernel mode
> > Cpuid = 0; apic id = 00
> > Fault virtual address = 0xc354efd4
> > Fault code                   = supervisor write, page not present
> > Instruction pointer    = 0x20 :0xc0aee1a3
> > Stack pointer              = 0x28 :0xc3020d48
> > Frame pointer            = 0x28 :0xc3020d5c
> > Code segment            = base 0x0, limit 0xfffff, type 0x1b
> >                                      = DPL 0, pres 1, def32 1, gran 1
> > Processor eflags        = interrupt enabled, resume, IOPL = 0
> > Current process         = 0 ()
> > Trap number              = 12
> > Panic : page fault
> > Cpuid = 0
> >
> >
> > On 8/13/09 1:22 PM, "Mike Hibler" <mike@flux.utah.edu> wrote:
> >
> > I did put together a 7.2 kernel and made sure it worked with the MFSes.
> > Give that a try:
> >
> >   http://www.emulab.net/downloads/tftpboot-kernels-7.2.tar.gz
> >
> > On Thu, Aug 13, 2009 at 12:10:33PM -0400, Espinola, Derek wrote:
> > > Mike,
> > >
> > > I did try 8GB and even down to 4GB of memory, same end result. I also tried the 6.4 kernel with the correct acpi.ko using 16GB/8GB and it crashes at  this now:
> > >
> > > Real memory = 655360 ( 0MB )
> > > Avail memory = 1154688 ( 0MB )
> > > ACPI APIC Table: <DELL PE_SC3 >
> > > Panic : hashinit : bad elements
> > > Uptime: 1s
> > >
> > > -Derek
> > >
> > > On 8/12/09 7:20 PM, "Mike Hibler" <mike@flux.utah.edu> wrote:
> > >
> > > On Mon, Aug 10, 2009 at 06:01:31PM -0600, Mike Hibler wrote:
> > > > So I was unable to reproduce this.  But our pe1950 only has 8GB of RAM in it.
> > > > If you are feeling daring, you could try to remove half the memory from your
> > > > machine, and we can see if the problem is related.  I doubt it though.
> > > >
> > > > Tomorrow, I will see if I can temporarily snag another 8GB out of the 2950
> > > > we have.
> > > >
> > >
> > > All the memory slots are full, so I could not add more memory.
> > > If you cannot try with 8GB, then we will move on.  We have to fix the
> > > problem one way or the other since I doubt you would want to remove 8GB
> > > from all your machines just so that our old kernel would boot!
> > >
> > > When you tried the new 6.4 kernel, you were using the older ACPI module.
> > > Try extracting both the kernel and acpi.ko from here:
> > >
> > > /usr/testbed/www/downloads/tftpboot-kernels-6.4.tar.gz
> > >
> > > into your /tftpboot/freebsd.newnode/boot directory.  Make sure the ACPI
> > > module gets loaded when the kernel does.  In the meantime, I am going to
> > > see if a 7.2 kernel will boot with the current MFS.
> > >
> > > Ultimately, we will be moving to a Linux-based MFS environment that we are
> > > testing now, so you might get to be the first external test case!
> > >
> >
>