Instead of having to guess the PC where the SP was sampled, always take
both. This allows "seamless" stack decoding for both serial and xflash
dumps, since we don't have to guess which function generated the dump.
Make the core functions (doing the sampling) be ``noinline`` as well,
so that they always have valid frame.
Save SP which is closest to the crash location, which simplifies
debugging. For serial_dump, write SP just before the dump.
For xfdump, save SP in the dump header.
This makes xfdump_dump and xfdump_full_dump_and_reset() equivalent for
stack debugging.