Allwinner A64 Clusterboard Reset Problem: Solved

A bare-metal compute node may soft-lock, spin-lock, deadlock, overheat, encounter resource starvation, the Docker daemon goes away, systemd becomes unstable, and on. In these cases, a watchdog timer acting like a dead man’s switch is not updated (pressed), a timer reaches zero, and the promise is the watchdog circuit restarts all the hardware like a power-on reset (POR).

Goal: Solve the Pine64 clusterboard A64 SoC reset problem.

My experience is that I can not get this working out of the box in U-Boot mainline for the Allwinner A64 SoC (sun50i) on a Pine64/SOPINE module in the clusterboard. This problem is more complex than I thought, so if it helps anyone else, I’ll document my thought process and investigation into how I made this work.

Results (so far)

  1. 2xAA RTC batteries in the clusterboard allow a reset.
  2. 2xAA RTC batteries in the powered clusterboard will die within a few months.

Investigation

Here is everything I’ve tried and my thought process while investigating this non-restart issue.

Kindly do not attempt to read everything. This is a detective story with many failures.

This is my investigation into solving a problem that has been unsolved for years. Here are the questions I’ve asked myself.

Is a reset as simple as jumping back to the bootloader code?

No. The CPU cores may be locked up, or have wrong clock signals, and cannot reach a jump to, say, location zero for the CPU to act as if it were just turned on. We’ll need a hardware solution to reset compute modules – a watchdog timer.

Does my System on a Chip (SoC) have a hardware Watchdog Timer (WDT)?

Yes and no. The Allwinner A64 SoC used in the SOPINE (Pine64) modules has a hardware watchdog timer in the processor (A64 PDF schematic, p161), but there is no dedicated watchdog circuit external to the proccessor (SOPINE PDF schematic).

Allwinner A64 watchdog circuit
Allwinner A64 watchdog circuit
How does a Watchdog Timer (WDT) reset the System on a Chip (SoC)?

From the previous diagram, the WDT can send an interrupt (IRQ) or a reset signal (but what does “reset signal” mean?).

Useful information from the Allwinner A64 docs:

  • Timer register base address: 0x01C20C00, offsets in brackets below.
  • WDOG_IRQ_EN_REG (0xA0)WDOG_IRQ_EN defaults to 0, no IRQs are sent.
  • WDOG_CFG_REG (0xB4)WDOG_CONFIG defaults to 1 and sends reset signals to the whole system.
  • WDOG_CTRL_REG (0xB0) – Set WDOG_KEY_FIELD to 0xA57 and WDOG_RSTART to 1 to trigger a reset.
  • WDOG_MODE_REG (0xB8) – Set WDOG_INTV_VALUE to 2 for 2s and WDOG_EN to 1 to enable WDT.


Or in code, to reset the SoC:

The WDT in the A64 has a countdown register with a maximum 16-second watchdog period. When zero is reached, it generates a system-wide “general reset”.

What exactly is a reset signal?

From the Allwinner A64 docs, the USB controller accepts a reset signal as a register flag to enter USB suspension. The Audio Controller (OWA) accepts a register flag to reset that controller. Even the unused Smart Card Reader (SCR) accepts a reset register flag. The CPU has a few reset registers as well. So, setting a bit in various registers around the SoC causes resets in those associated controllers.

More on the CPU reset, it includes core reset, power-on reset, and H_Reset. The last mode will reset the cluster of cores, i.e. all the cores.

So how do these various reset registers get set? After all, the CPU has gone awry so it cannot set those registers. Let’s look at the A64 bus diagram.

Allwinner A64 advanced peripheral bus
Allwinner A64 advanced peripheral bus

There is something called an Advanced Peripheral Bus (APB) connected to the WDT (timer), which is connected to similar buses.

What is an Advanced Peripheral Bus (APB)?

APB is designed for low bandwidth control accesses, for example, register interfaces on system peripherals. This bus has an address and data phase similar to Advanced High-performance Bus (AHB), but a much reduced, low complexity signal list (for example no bursts). APB is part of Advanced Microcontroller Bus Architecture (AMBA) products licensable from ARM Limited.

From the system diagram above, we see the WDT (timer) puts addresses and data on the APB which crosses a bridge to the AHBs and sets registers in the various controllers – and reset registers are set around the SoC. Now I understand how the WDT resets all peripherals with dedicated buses.

Can we trigger a Power-On Reset (POR) programmatically with the Power Management IC (PMIC)?

There is an AXP803 Power Management IC (PMIC) external to the A64 SoC that has the ability to vary its voltages programmatically. It’s the power rails for the SoC and peripherals. Can this be programmed to power cycle the SoC?

AXP803 PMIC reset signal
AXP803 PMIC reset signal

By physically grounding the PWROK line, the PMIC shuts off. When the grounding is removed, the PMIC comes back to life as if the device were just plugged in. Then, what toggles the PWROK line if the CPU loses power?

AXP803 serial interface
AXP803 serial interface

There is a serial interface. What can this do for us? Short answer: nothing. This is a blind alley. The AXP803 is primarily a Li-ion charging IC and does not have any mode like “shut off and then promise to turn back on”. Some external hardware or at minimum an RC (resistor-capacitor) circuit would be needed to achieve a PMIC reset with the SOPINE’s AXP803 PMIC. Let’s stick with the WDT solution.

Does my SoC have a Brown-Out Detect (BOD) circuit?

The AXP803 PMIC monitors such situations as low power, bad battery, PWRON pin signal, over-temperature, and GPIO input edge signals. When the events occur, the corresponding IRQ status will be set to 1 and will drive the IRQ pin low. It’s up to the host to consume/notice this IRQ. When voltage does drop, the PMIC will lower current until the primary voltage rises.

The AXP803 communicates with the A64 via the Reduced Serial Bus (RSB) and thus the A64 can “notice” incoming power error states. That is a rabbit hole I’ll leave, but to answer this line of thinking, yes, there is action taken on brown-out situations.

What is the difference between PSCI and SYSRESET in U-Boot?

This is where things go off the rails. There are several community patches/hacks to add support for the AXP803 PMIC, adding a sunxi WDT reset via writing directly to registers, and enabling Driver Model (DM) reset classes. Merely experimenting with PSCI, RESET, SYSRESET, and DM_RESET in U-Boot leads to compiler errors like “Error: do_reset() is already defined”, or runtime errors like “System reset not supported on this platform”, or even the board just hangs. What are the main options?

PSCI

Power State Coordination Interface (PSCI) is used for CPU and overall system power management used for system shutdown and reset. When CONFIG_PSCI_RESET is enabled, on reset, some instruction is written to the PSCI subsystem at an address through the PSCI 0.2 interface when CONFIG_ARM_PSCI_FW is also enabled. Seems simple, and seems complicated. Overwhelmingly, most ARM boards have # CONFIG_PSCI is not set in their defconfigs.

SYSRESET

The vast majority of ARM boards have CONFIG_SYSRESET=y in their defconfigs. This seems to be a modular way to reset various components on the SoC programmatically. It has provisions for warm and cold resets, as well as resetting the PMIC (power off then on according to sysreset.h).

SYSRESET_PSCI

To murky the waters, inside SYSRESET it can interact with PSCI to do the same as PSCI when CONFIG_SYSRESET_PSCI is enabled, but very, very few defconfigs have this.

SYSRESET_WATCHDOG and SYSRESET_RESETCTL

You can see this is getting confusing. This is where I get off this train and experiment with registers, myself.

Can I trigger a reset by writing to hardware registers?

Having failed to find the right combination of configuration flags over and over again, my new approach is to cause the WDT to fire via a timeout, and eventually via the reset command in the U-Boot shell. My goal is to get the manufacturer’s sample reset code to execute and observe a proper reset by monitoring the A64 via the serial cable.

After disabling PSCI reset with # CONFIG_PSCI_RESET is not set in the defconfig file, let’s examine arch/arm/mach-sunxi/board.c. It has a section with writel instructions from 0x01c20c00. The addressing and bit-twiddling seem fine, actually. When I explicitly try to invoke a sequence of register writes, either nothing happens, or the system hangs when an mdelay() statement is reached, or the board just halts.

Here again is the manufacture’s recommendation.

Being absolutely explicit with my hex values, here is what I tried in code. This loops forever.

Frustrating.

Am I the only one with this problem?

My initial search for “Pine64 reset” led nowhere (too specific). There are a handful of unanswered pleas for help in the forums, which is why I tried to debug U-Boot on my own.

One day whilst reaching my wit’s end, I instead searched for “A64 watchdog reset” which led me to a deep thread with brilliant people collaborating in the thread titled “H6 Famous Reboot Problem” with nine pages. Allwinner makes the A64 and H6, the latter being very similar to the A64, but with better video support (not needed in a cluster computer). Jackpot.

People even describe the same path I took:

“I’ve tried to debug the reset_cpu() in arch/arm/mach-sunxi/board.c where it set some Watchdog register and loop infinitely, but it seems that watchdog never kicks in.” (ref)

There was a false victory.

“Bingo! The missing thing is CONFIG_NR_DRAM_BANKS=1.”

Could a certain flag not be set?

“Maybe nowayout param should be set to 1? I remember that nowayout=0 on H3 just disables watchdog hardware reset.”

The next idea was looking at the Arm Trusted Firmware (now called Trusted Firmware A, or TF-A).

“Mainline u-boot has a reset command, which triggers a watchdog-based reboot, and it just locks up the machine, when the watchdog timeout expires. The same thing simply happens in the kernel. The kernel tells ATF to reset, ATF does the same thing as u-boot (watchdog-based reset), and the SoC locks up.” (ref)

A sign of hope emerges.

“Changing to R_WDOG instead of WDOG in ATF fixes the issue. … A patch can be added to build/patch/atf/atf-sunxi64/.” (ref)

A consensus emerges that the problem is in the ATF (now called TF-A), and the fix (for H6) is as simple as:


Can the Allwinner H6 TF-A reset solution be applied to the A64?

We’re talking about the trusted watchdog now. Could the solution be as simple as pointing the regular watchdog code to the trusted watchdog? Let’s look at the system bus again.

Allwinner A64 trusted watchdog
Allwinner A64 trusted watchdog

It seems that in 2021 TF-A is already using the secure watchdog (SUNXI_R_WDOG) as we can see below. There is nothing to do here for the A64.

How to get U-Boot to call the TF-A trusted-watchdog system reset?

Since the ARM trusted watchdog is, well, trusted (right?), there needs to be communication from user-world to secure-world through the TF-A. Now, the TF-A has the sunxi_system_reset() defined in both sunxi_native_pm.c and sunxi_scpi_pm.c.

We now have to go even deeper to the SCPI, which stands for System Control and Power Interface. Which of the two implementations is used? According to the logic in allwinner-common.mk, the native implementation is used by default. So, how to call this programmatically?

Let’s chase down the “native PSCI ops” structure in sunxi_native_pm.c and see who executes the operation “system_reset”. This led to psci_system_off.c with a method called psci_system_reset(void).

Okay, going deeper, who calls psci_system_reset() then?

Sigh. What’s an SMC?

What is the Secure Memory Controller (SMC) and how to use it to trigger a reset?

The SMC is an Advanced Microcontroller Bus Architecture (AMBA) compliant SoC peripheral. It is an address-space controller with on-chip AMBA bus interfaces. The user guide gets wordy, but let’s say it’s a gatekeeper to protected address space that the TF-A secure code uses.

Back to U-Boot, we see that in the DTS for the A64, PSCI uses SMC.

We’ve now come full-circle back to PSCI in U-Boot. Let’s drop to the U-Boot shell and try to issue some SMC commands manually to see if it even works. Add CONFIG_CMD_SMC=y first.

No obvious docs. No SMC examples. Just treading water in the deep end of the pool. What even is a Function ID? In U-Boot mainline, I found a lonely file called durian.c and saw a hint:

Looking at the command processor for smc, it also arrives at arm_smccc_smc() via a method called do_call() in smccc-call.c, and again in a method named invoke_psci_fn().

Chasing down the latter, I found invoke_psci_fn(PSCI_0_2_FN_SYSTEM_RESET, 0, 0, 0) deep in code. Then PSCI_0_2_FN_SYSTEM_RESET is defined as PSCI_0_2_FN(9). We eventually arrive at:

Happily, we find that the Function ID for reset is also 0x84000009. Let’s trigger a system reset via TF-A using the smc command in U-Boot.

Absolutely nothing happened, except the board still hangs. Back to square one. This seemed like a hack, anyway. Moving on.

Is there a pointer math error?

I took a deep dive into the Allwinner A64 user manual again, and looked at WDT register offsets in a C++ struct. Are the struct offsets correct? I see u32 (4-bytes) entries, so does u32 ctl (WDOG_CTRL_REG) truly start at the 0x10 offset, or incorrectly at 0x12 (4 bytes * 3)?

Wouldn’t that be nice if this was a simple pointer error? Let’s see with a quick test.

So, win, the 0x10 offset is correct. But, the pointers are all wrong.

Even with the correct register addresses, the board still does not reset.

Does the watchdog physically even work in the first place?

I found a thread suggesting that the Allwinner hardware may be broken.

The issue is real except on Pine H64 and Rongpin RP-H6B which seems to be NOT affected. Lot of users on OrangePi boards (Lite2 / One Plus and 3) are complaining about this issue.

and

We perform a simple watchdog test on different boards:

Pine H64 = H6 V200-AWIN H6448BA 7782 => OK
OrangePi Lite 2 = H6 V200-AWIN H8068BA 61C2 => KO
PineH64 = H8069BA 6892 => OK
Orange Pi 3 = HA047BA 69W2 => KO
OPiOnePlus = H7310BA 6842 => KO
OPiLite2 = H6448BA 6662 => KO
Beelink GS1 = H6 V200-AWIN H7309BA 6842 => KO

The community consensus again for sunxi hardware again is to use the trusted watchdog (R_WDOG). But, how?

In the 705-page Allwinner A64 user guide, it only has one solitary reference to an R_WDOG register at 0x01F01000 on page 74. That, and a mention of R_WDOG being a secure module under the “CPUS” domain. There is no reference on how to use the module or what register offsets it uses.

Let’s see if I can add the trusted watchdog to the A64 device tree with a graft from the H6 device tree.

No effect. Truthfully, I’m not even sure if the GIC_SPI 103 grafted to the A64 does anything. I’ll leave this alone and try something else.

Do I have a core with an old revision that might be defective?

Let’s find the A64 revision number using a function in assembly we cannot normally access.

My board is on revision 4, and that is the latest revision. Good, good.

Let’s make sure all the processor errata are enabled in TF-A to be safe (in a Dockerfile).

Let’s keep going and try something else.

As a PoC, can I hack TF-A to enter an infinite reset loop via secure watchdog?

Let’s see if instead of loading U-Boot SPL after TF-A, can I directly invoke the secure watchdog reset code that I tried earlier?

Results: the system hangs.

Let’s dig a little deeper with an experiment in timing:

Here are several timing results:

This means the CPU keeps running after the watchdog timer starts, then eventually halts, but still no restart. Running the test a dozen more times and eyeballing the average shows that the CPU runs for about two seconds before halting. This coincides with setting TIMER_REG + 0xB8 to (1 << 5) which is a two-second watchdog period. This at least means the regular watchdog fires on time.

Changing TIMER_REG from 0x01C20C00 (WDT) to 0x01f01000 (R_WDT) results in the same behaviour. Then, why doesn’t the system restart?

On reset, does the instruction pointer jump past the BL1 (Boot ROM) code into nothingness?

The watchdog times out and the CPU(s) halts. What would make the CPU halt? Plausibly, if it lost power, or a jump instruction landed in a NOP slide to the end of memory. Just to be safe, I’ll explicitly set the reset vector for each core. Note: Allwinner isn’t clear if CPUCFG starts at 0x01700c00 or 0x01700000, so I tried both.

No new effect.

Can we even write to the reset vectors?

Yes, we can. From the experiment below, 0x01700000 + 0xa0 is writeable from 31:2.

Having experimented with writing several jump points like 0x04000000, 0x80000000, 0x80010000, and of course 0x00000000, I’m still no closer to solving this.

Is this a clusterboard problem only, or does reset work on the baseboard?

Reset works in the baseboard (thanks, Dave).

Reset works in the SOPINE baseboard
Reset works in the SOPINE baseboard

The exact same SD card with the hacked TF-A code enters an infinite reset loop on the baseboard. See below.

Why does this work? The power chip (PMIC) is on the SOPINE module. The SoC has the Advanced Peripheral Bus (APB), which carries the reset signal, internal to the A64 die. The only notable external components on the clusterboard are the RTL8370N Ethernet switch IC and the RTL8211E Ethernet port chip per SOPINE. The baseboard also has the RTL8211E Ethernet port chip.

What happens if USB and Ethernet devices are disabled in the device tree to remove external interference factors?

In this experiment, I’ve disabled USB and Ethernet in the device tree in case they somehow hold up the reset procedure.

No restart observed. The SoC still hangs.

Is there a way to interrupt the reset process? Can a peripheral or module prevent a restart?

The reset signal (special register writes via the APB bus) is sent to all the SoC modules, not just the CPU. This probably isn’t a parallel operation because if all peripherals are on the same bus, then they need addressing, and only one address can exist on the bus at a time, so the reset signal must be a synchronous process. Can this synchronous reset queue be held up somehow? That is what I was trying to rule out by disabling the USB and Ethernet previously.

Let’s dive into the ARM SoC watchdog module SP805 for some context.

Skimming over the details, the most important information I could tease out of the watchdog docs is that it requires two clocks – one to drive the watchdog counter, and the other to drive the APB bus. Could the APB bus clock have stopped somehow on the clusterboard but not the baseboard? Let’s come back to this later.

What other hardware differences are there between the baseboard and the clusterboard?

There is one other difference between the baseboard and clusterboard: powering the real-time clock (RTC). Let’s compare.

SOPINE baseboard's RTC is powered by the rails
SOPINE baseboard’s RTC is powered by the rails (source)
SOPINE clusterboard uses mandatory AA batteries for RTC
SOPINE clusterboard uses mandatory AA batteries for RTC (source)

Did Pine64 revise the SOPINE module to sideline the 3.0V VCC-RTC from the power IC (PMIC) so only a physical battery can power the RTC? The schematics are in flux, so I’ll leave it to the experts to decide:

SOPINE module doesn't use 3.0V PMIC RTC anymore?
SOPINE module doesn’t use 3.0V PMIC RTC anymore?

Just to be more convincing, here is the Pine64 LTS schematic:

Pine64 LTS PMIC VCC_RTC is unused
Pine64 LTS PMIC VCC_RTC is unused

Why should this matter? Isn’t the RTC optional and used to keep the date and time roughly accurate while the SoC is powered off? Let’s pull on this thread a bit since a powered RTC seemingly allows the SoC to reset.

How important is the real-time clock (RTC) to the A64 SoC? Isn’t it supposed to be optional?

From Allwinner,

The real-time clock (RTC) is for calendar usage … The unit can be operated by the backup battery while the system power is off. … The alarm generates an alarm signal at a specified time in the power-off mode or normal operation mode. In normal operation mode, both the alarm interrupt and the power management wakeup are activated. (source)

It seems the RTC has an alarm that is capable of waking up power management. This could be what restarts the CPU. This could be the RTCINTR signal in the functional block diagram below from ARM.

ARM RTC block operation
ARM RTC block operation

Let’s ask some more questions.

Does the A64 VCC-RTC pin power everything in the RTC block?

If you are like me, you probably haven’t wondered why some SoC modules are prefixed with “R_”.

A64 RTC and R-modules
A64 RTC and R-modules

From my research,

The AR100, also called the CPUS or ARISC in SoC documentation is a coprocessor present in the A31 and newer sunxi SoCs. While the name “AR100” refers only to the OpenRISC CPU core, the processor is tightly integrated with other “RTC block” hardware. In general, any device whose name begins with “R_” is intended to be controlled by the AR100. This includes the R_PIO, R_PRCM, and several timers. (source)

From the A64 power docs, there is a VDD-CPUS pin to power the above. It’s safe to say that the A64 VCC-RTC pin powers only the RTC. So, no RTC power, no RTC alarm?

Could the clusterboard WDT reset solution be as simple as adding 2xAA batteries to the clusterboard to power the RTC?

Holy smokes, the clusterboard resets!

Powering the real-time clock (RTC) in the A64 SoCs (strangely) allows a reset.
Clusterboard 2xAA batteries to power the RTC
Clusterboard 2xAA batteries to power the RTC

Baring any other discoveries, the current hypothesis is that the RTC needs power, and the only way to achieve that on the clusterboard is with batteries.

Hang on. The RTC runs without batteries; batteries keep it going when the main power is off.

Let’s examine some RTC registers without and with external RTC (battery) power. Here is the test code.

First, without batteries.

Now, with batteries to power the RTC. This is the same for the SOPINE baseboard.

The RTC seconds counter increments with and without batteries, as expected, but the alarm registers are all empty in both cases. I suspect the RTC is a red herring.

Let’s examine the schematics again. Some pins are pulled high by the VCC-RTC line.

In several ARM SoC designs, the external non-maskable interrupt (AP-NMI#) pin and AP-RESET#1 pin are pulled high by the VCC-RTC. For example:

Example: A31 NMI pulled high by VCC-RTC
Example: A31 NMI pulled high by VCC-RTC

What is unique about the Pine64 designs is that the power IC (PMIC) does not feed the VCC-RTC line, whereas the majority of other SoC board designs have the dedicated, regulated PMIC VCC_RTC output feeding the RTC along with the battery via the VCC-RTC line as it is in the always-on power domain.

Pine64 LTS PMIC VCC_RTC is unused
Pine64 LTS PMIC VCC_RTC is unused


AXP803 power-management IC (PMIC) facts (ref):

  1. APX803’s PWROK pin is pulled up to RTCLDO (outputs to VCC_RTC) internally.
  2. RTCLDO is always on, even during power down or reset.
  3. RTCLDO is powered by IPSOUT and feeds from ACIN/VBUS or BAT.
  4. PWROK is tied to AP-RESET# on the A64 SoC.
  5. PWROK stands for Power-On Key, not “Power OK”.
  6. When PMIC is shut down, VCC_RTC will be shut off for two seconds and pulled to GND via 1kΩ.
  7. The IRQ pin needs a 10kΩ pull-high (usually to VCC-RTC) as it is NMOS open-drain.

Inferences:

  1. Without VCC-RTC to pull PMIC’s IRQ pin high, IRQ floats or is grounded. Without the battery does PMIC fail to signal an interrupt (AP-NMI#) to the SoC (possibly missing a wake-up signal)?
  2. AP-RESET# is pulled high internally in the PMIC via RTCLDO, so the battery doesn’t affect this logic.

My new working hypothesis is that the NMI is never properly asserted without a battery on the clusterboard.

How important is the NMI pin to a reset on the SoC?

Interrupts are complex, so I’ll itemize some facts I’ve learned about the NMI pin.

SoC interrupt facts (ref):

  1. Allwinner sun50i SoCs (A31 and newer) have two interrupt controllers: GIC and R_INTC.
  2. GIC does not support wakeup and is inaccessible from the ARISC (power CPU).
  3. All IRQs that can be used to wake up the system are routed through R_INTC.
  4. All wake IRQs are enabled during suspend.
  5. R_INTC controls the NMI pin, the trigger, and mask for the NMI input pin.
  6. R_INTC provides the interrupt input for the ARISC coprocessor.
  7. R_INTC is in the always-on power domain.
  8. NMI pin is routed to the “IRQ 0” input on R_INTC.
  9. NMI trigger type is controlled by the NMI_CTRL_REG.
  10. SCP firmware = Crust = power management firmware.
  11. During suspend, the Crust will enable the interrupt input to the AR100.
  12. AR100 will treat any IRQ (subject to a mask) as a trigger to wake up.
  13. AR100 = CPUS = ARISC.

The NMI pin is the second-highest interrupt (IRQ), second to the RESET interrupt. When the SoC is reset or suspended, the NMI can easily trigger a wake-up and/or reinitialization of the BROM, CPUs, peripherals, and on. Additionally, the PMIC IRQ pin is asserted on thermal problems, rechargeable battery removal/insertion, power drop, and other programmable situations.

Is VCC-RTC getting power on the SOPINE baseboard but not the clusterboard?

Here is the SOPINE baseboard schematic. Below is the PCB trace of the baseboard just for fun.

SOPINE baseboard PCB traces
SOPINE baseboard PCB traces

On my baseboard, diode OD4 is missing (which is good because VCC-RTC and BAT-RTC are shorted through a 0Ω resistor), so VCC-RTC is seemingly only powered by a battery. Let’s put a multimeter on VCC-RTC and see if it is powered.

Baseboard electrical measurements:

  1. SOPINE removed, power on, VCC-RTC is 0V.
  2. SOPINE removed, power off, VCC-RTC resistance is infinite.
  3. SOPINE inserted, power on, VCC-RTC is 2.78V.
  4. SOPINE inserted, power off, VCC-RTC resistance increases from ~1.5MΩ (settles on 3.25MΩ).
  5. SOPINE inserted, power on, 1kΩ series resister, VCC-RTC draws ~2.7mA.
  6. SOPINE inserted, power off, VCC-RTC capacitance is 0.96uF.

Clusterboard (v2.3) electrical measurements:

  1. SOPINEs removed, power on, VCC-RTC is 0V.
  2. SOPINEs removed, power off, VCC-RTC resistance is infinite.
  3. 7xSOPINE inserted, power on, VCC-RTC is 2.78V (across battery holder).
  4. 1xSOPINE inserted, power off, VCC-RTC resistance increases from ~700kΩ (settles on 3.26MΩ).
  5. 7xSOPINE inserted, power off, VCC-RTC resistance increases from ~140kΩ (settles on 240kΩ).
  6. 7xSOPINE inserted, power on, 1kΩ series resister, VCC-RTC draws ~2.7mA.

Clusterboard (v2.3) VCC-RTC capacitance measurements:

  1. 1xSOPINE inserted, power off, VCC-RTC non-convergent capacitance.
  2. 2xSOPINE inserted, power off, VCC-RTC capacitance is 1.9uF.
  3. 3xSOPINE inserted, power off, VCC-RTC capacitance is 3.1uF.
  4. 4xSOPINE inserted, power off, VCC-RTC capacitance is 4.7uF.
  5. 5xSOPINE inserted, power off, VCC-RTC capacitance is 6.6uF.
  6. 6xSOPINE inserted, power off, VCC-RTC capacitance is 8.9uF.
  7. 7xSOPINE inserted, power off, VCC-RTC non-convergent capacitance.

Clusterboard (v2.3) battery measurements:

  1. 2xAA new lithium batteries voltage is 3.6V.
  2. 2xAA, no SOPINEs inserted, power off, current draw is 0.00mA.
  3. 2xAA, 1xSOPINE inserted, power off, current draw is 0.07mA.
  4. 2xAA, 3xSOPINE inserted, power off, current draw is 0.19mA.
  5. 2xAA, 7xSOPINE inserted, power off, current draw is 0.45mA.
  6. 2xAA, 1xSOPINE inserted, power on, current draw is 0.43mA, resets.
  7. 2xAA, 2xSOPINE inserted, power on, current draw is 0.86mA, resets.
  8. 2xAA, 3xSOPINE inserted, power on, current draw is 1.29mA, resets.
  9. 2xAA, 7xSOPINE inserted, power on, current draw is 3.01mA, resets.
  10. 1xAA, 1xSOPINE inserted, power on, current draw is 160mA, no reset.
SOPINE VCC-RTC 1.1uF capacitor bank
SOPINE VCC-RTC 1.1uF capacitor bank

Observations:

  1. If VCC-RTC were connected to the PMIC’s VCC_RTC, then the 10uF (C70) would be in parallel, and the single SOPINE capacitance would be 11.1uF, not 1.1uF.
  2. A single SOPINE also doesn’t restart without battery power, same as seven SOPINES, in the clusterboard.
  3. A clusterboard SOPINE has a mathematical VCC-RTC resistance of 1.68MΩ.
  4. 2xAA lithium batteries with 7000mAh will last 1.77 years at 0.45mA RTC draw (power off).
  5. 2xAA lithium batteries with 7000mAh will last only 97 days at 3.01mA RTC draw (power on).
Why is the mathematical VCC-RTC resistance of a clusterboard SOPINE 1.68MΩ when a measured resistance is 3.26MΩ?

Parallel resistance

Given the measured 240kΩ VCC-RTC resistance across seven SOPINES, each one must have a 1.68MΩ resistance, but one was measured at 3.25MΩ. That is suspicious. Let’s measure the resistance across each SOPINE individually.

Settled VCC-RTC resistance per SOPINE:

Measured VCC-RTC resistances across SOPINEs
Measured VCC-RTC resistances across SOPINEs

The parallel resistance is 278kΩ which is reasonably close to 240kΩ observed. We’ve learned the SOPINEs have different internal VCC-RTC resistances.

Do the different SOPINE VCC-RTC resistances affect the restart?

No. Both the 3.26MΩ SOPINE and the 1.19MΩ SOPINE fail to restart in the clusterboard, but both restart just fine in the baseboard. The problem likely isn’t related to a silicon defect.

Could the RTC’s external 32.768 kHz crystal not be active with no VCC-RTC?

No. Measured with an oscilloscope, the 32.768 kHz crystal (found just below the word “Designed” on the back of the SOPINE) outputs a perfect 32.768 kHz sine wave without batteries on the clusterboard. My hope was that somehow the xtal was unpowered so the RTC alarm wouldn’t activate.

Why do both the baseboard and the un-batteried clusterboard have 2.78V on the VCC-RTC line?

It’s possible that power is supplied internally by the A64 SoC in lieu of no external, dedicated VCC-RTC from batteries. I’m not able to find detailed power diagrams for the A64 SoC, but from a design point of view, it makes sense that the VCC-RTC pin is not electrically isolated while the RTC is on main power.

Could there be a 0.22V drop on the VCC-RTC line through a diode from 3.0V from the PMIC? No. I’ve established that the PMIC isn’t powering the VCC-RTC line. Also, the Schottky diode that was “deleted” from the schematics has a 0.49V drop which is too high.

Is the baseboard under-voltaged?

No. The clusterboard’s [email protected] adapter outputs 5.16V, while the baseboard’s [email protected] adapter outputs 5.36V. This doesn’t matter because the 5.15V is far away from the low-dropout voltage near the PMIC voltage of 3.0V for the RTC (which isn’t connected, anyway). I used an external PSU to reach 5.36V to power the clusterboard just to cover this unlikely possibility. External power can be excluded as a restart culprit.

How can I prevent a WDT restart in the baseboard?

If I can prevent a restart in the baseboard somehow, it may help understand how the wakeup process happens after a WDT reset.

However, when I cripple the device tree and the PMIC regulator initialization code in TF-A, baseboard WDT restarts still take place. I have not been able to prevent WDT restarts in the baseboard.

Crippled PMIC regulators in TF-A code:

Crippled device tree:

Even with disabled nodes and disabled interrupt controllers, WDT resets still take place on the baseboard.

No effect. Let’s disable that “ARM GICv2 driver” in TF-A code next.

No effect. Let’s remove more TF-A code and see what happens. I’ve removed the security setup and even PMIC initialization.

No effect. Let’s obliterate all DTB loading code in the TF-A next.

No effect. Let’s initiate a WDT reset as the very first thing TF-A does – a Hail Mary pass. Here is the code and results.

No effect. Literally, the first action TF-A takes is to initiate a WDT reset, and it succeeds. Without DTB initialization, without GIC interrupts setup, and without any register writes at all except the WDT reset, a reset takes place. How can the baseboard WDT reset be disabled?

Does the RX UART0 pin on the baseboard being pulled high trigger a wakeup?

Another hardware difference between the clusterboard and baseboard is the presence of a small protection circuit to prevent the RX pin of UART0 from supplying power to the SoC while it is turned off.

RX pin pulled high on the baseboard only
RX pin pulled high on the baseboard only

The clusterboard has no such circuit. As an experiment, I connected the clusterboard’s Z1 regulator’s input (DCDC1) to a 10kO resistor and then to the PB9_A pin of J4. No effect. No reset was observed.

With no battery, there should be 0V across the OD3 Schottky diode, right?

Previously, I measured 2.78V across the baseboard’s battery jack in the diagram below.

RTC battery jack on baseboard
RTC battery jack on baseboard

This perplexed me earlier and sent me off in another direction because the un-batteried RTC voltage is the same in both the clusterboard and baseboard. According to the schematics, with no battery and hence no electrical connection to the OD3 Schottky diode, there should be 0V across the diode.

Phantom diode voltage should be 0V
Phantom diode voltage should be 0V

That is not 0V. Moreover, when I walk away and come back, the voltage across the diode changes.

Phantom diode voltage changes
Phantom diode voltage changes

Here is a circuit representation of the diode having an EMF across it.

RTC OD3 diode voltage schematic
RTC OD3 diode voltage schematic

What is really curious is that my body’s capacitance affects the voltage readings across the RTC. When my leg moves closer to the multimeter’s probe cables, the voltage changes.

My background is in discrete electronics, so forgive me if EM noise and stray capacitance were easily discerned by the reader. I pulled on this thread just for a bit, next.

Coming back to that 2.78V, the RTC voltage across the battery jack changes wildly depending on how far or close my body is to the baseboard like a theremin, or if I touch the positive line. Is the baseboard that sensitive?

Here is the same as above, but with a simple oscilloscope on the VCC-RTC battery line.

This waveform is interesting, actually.

VCC-RTC ripple waveform
VCC-RTC ripple waveform
Does this waveform come from the 5V wall adapter?

With the oscilloscope, the DC-out of the wall adapter is a clean 5.3V with no latent ripple. You could level a picture frame by how flat the output is. This is surprising because the frequency of the waveform is approximately 60 Hz. Even though the voltage is clean, we shouldn’t discount the ground loop.

Why does this waveform look familiar?

At first blush, it looks like a square wave is fed to a high-pass RC filter as explained in this StackExchange thread.

RC differentiator waveform
RC differentiator waveform (credit)

There is no 60 Hz PLL on the SOPINE. Any boost converters would operate outside human hearing well past 20 kHz. The two external clocks are 24 MHz and 32.7 kHz. The latter may cause a harmonic which appears to be around 60 Hz. Or, it may just be me and my human capacitive-antenna absorbing transient EM signals from the lights and mains cabling.

Does this waveform manifest when the baseboard is on 5V batteries?

No. Operating on batteries in a minimal-EM environment results in the disappearance of the above waveform. As Dave shared with me, my body is likely operating as an antenna for surrounding mains EM with the wall adapter facilitating the unwanted circuit.

On batteries through a 5V USB battery, we can see 2.78V is consistently observed without stray, human capacitance.

As an aside, just measuring the waveform across the RTC battery terminals with an oscilloscope drops the voltage reading from 2.78V to 0.3V. This oscilloscope has capacitance and resistance, and it looks like it is preventing an RC circuit from charging a baseboard capacitor. When the scope is removed, a baseboard capacitor slowly charges as seen below.

Let’s explore something else and come back to that diode voltage.

Let’s measure the baseboard OD3 diode voltage again under 5V battery power.

This time on 5V batteries, there is no OD3 diode voltage anymore. It is gone. It was an artifact of stray capacitance and ambient EM radiation passing through a rectifier.

We can, however, bypass the diode and see how the VCC-RTC line behaves. When I short the OD3 diode, the oscilloscope no longer interferes with the VCC-RTC voltage and we can see about 2.78V.

The clusterboard has no such rectifier and shows a nominal scope output across VCC-RTC.

Clusterboard VCC-RTC voltage under battery power
Clusterboard VCC-RTC voltage under battery power

Is the conclusion that the VCC-RTC is also unpowered on the baseboard, there is no leak or PCB error, yet an externally-powered RTC is not what wakes up the baseboard?

The 2.78V is still a mystery. There is no documentation on it. That needs to be resolved.

Unable to give up, let’s turn to the powered microscope and physically map the VCC-RTC line on the SOPINE module.

SOPINE PCB top near the PMIC and VCC_RTC pin
SOPINE PCB top near the PMIC and VCC_RTC pin

Hello. Where is this little hole taking the electron flow? There should be no other components on the PMIC’s VCC_RTC line beside that 10uF capacitor. There was an OD4 diode, but it is deleted from the schematics.

OD4 is labeled as deleted
OD4 is labeled as deleted

Let’s see where this hole goes.

SOPINE PCB bottom near the SD card slot
SOPINE PCB bottom near the SD card slot

The diode exists. In fact, it is the only Schottky diode in the schematic. It is stamped “SS”, short for “SS14”, which is an AliExpress-found equivalent for the XBS104S14 the schematic called (calls?) for. It’s not supposed to exist. Could this be powering the VCC-RTC line? Let’s see.

The VCC-RTC line is powered by the PMIC. In fact, placing the probe right on pin 49 of the PMIC (VCC_RTC) shows 3.00V. Placing the probe on the diode also shows 3.00V on one side and 2.80V on the other side.

Wait. The voltage drop across the diode is 0.2V? That doesn’t match the typical forward voltage drop of 0.49V in the specs. It turns out in the graphs, under a very low current the forward voltage drop can be around 0.2V.

XBS104S14 forward voltage response curves
XBS104S14 forward voltage response curves

This 2.80V is remarkably close to the 2.78V measured across the battery terminals, enough so that I am satisfied.

The real-time clock (RTC) is powered and running. Then why does only an RTC battery enable a clusterboard SOPINE wakeup, but one is unnecessary in the baseboard? We are seemingly back at square one.

Is the VCC-RTC line dirty? Does the RTC ever lose power, however briefly?

A voltmeter and an oscilloscope with a microsecond range (by default) show an instantaneous 2.78V. Let’s make the scope more sensitive and stretch the time range to see if there are any dropouts.

Clusterboard VCC-RTC voltage dropouts
Clusterboard VCC-RTC voltage dropouts

Yes! The clusterboard experiences small voltage dropouts when observed over a period of several seconds. From the image above, the RTC voltage drops from 2.78V to about 2V briefly. Here is a video of this phenomenon.

What causes these RTC voltage dropouts?

These voltage dropouts only happen when on 5V battery power. When the clusterboard is powered by the 15A brick adapter, no dropouts occur. I suspect a clusterboard with a single SOPINE module draws more than 1A – the limit of my USB batteries.

Can I try something random that makes little sense but might just work?

Why not give the clusterboard VCC-RTC the full 3.00V without 2xAA batteries? Shot in the dark, with only one SOPINE on the clusterboard for obvious safety, let’s short the OD4 diode on the SOPINE module (to reclaim that 0.2V drop) and observe the UART line to see what the SoC does if anything.

Short the OD4 Schottky diode to gain 0.2V
Short the OD4 Schottky diode to gain 0.2V

The SOPINE in the clusterboard sometimes resets! Without 2xAA batteries (and yes without shorting VCC to GND), a reset is triggered, but only sometimes.

Could it be possible there is just not enough current reaching the RTC? When on USB 5V batteries, the VCC-RTC did experience voltage dropouts seemingly around when the RTL8211E chips were blinking. Raising the VCC-RTC line to 3.00V may have taken the line right on the cusp of supplying sufficient current to fully operate the RTC.

Let’s trace the current differences between the baseboard and clusterboard.

The AXP803 PMIC spec shows the RTCLDO (VCC_RTC) supplies 60mA typical. There is a 10kΩ pull-up resistor drawing 2.8V/10kΩ or 0.28mA ≪ 60 mA, so current isn’t lost there. How about that PWROK pin? We learned earlier it is tied to RTCLDO. Let’s see where its current goes.

Possible RTCLDO current through the reset system
Possible RTCLDO current through the reset system

The worst-case current seems to be if RESET is tied to GND which becomes 3.0V/1kΩ or 3mA ≪ 60 mA. However, there is a huge difference between the baseboard and the clusterboard: the baseboard has no reset mechanism, while the clusterboard has NOT-gates which constantly draw current.

Possible RTCLDO current drain through reset system
Possible RTCLDO current drain through reset system

How much current? Let’s turn to the 74LVT04 logic IC specs.

74LVT04 current and voltage specs
74LVT04 current and voltage specs

The -32mA stands out for the high output which is the default state. However, two NOT-gates are used per SOPINE, so I’m unsure how to calculate the negative current draw on RESET. It’s possible there is enough EMF to draw a higher current on the RESET line through the PWROK pin and finally through the VCC_RTC which meets or exceeds the 60mA (and 100mA max) of the PMIC.

Can we simulate the reset circuit?

After two days of effort with Multisim and OrCAD, the experience is too painful. The NXP line of BiCMOS 74LVTxx components is not found in either simulation suite, and efforts to download and/or import the model files from 1999 have left me shaking my head.

OrCAD 74LVT04 import failure
OrCAD 74LVT04 import failure

I’m hoping my StackExchange question can help. I got close with OrCAD, but not close enough. The problem with using digital parts in Multisim is the internal current and resistances are not simulated, which is what we crucially need.

In the meantime, let’s resist the temptation to slice this delicious wire separating the SOPINE RTC from the clusterboard, yet in which doing so would definitely answer the question of whether the non-reset culprit is ironically the reset distributor.

Clusterboard RESET_A line from 74LVT04
Clusterboard RESET_A line from 74LVT04
Let’s model the 74LVT04 IC electrically and simulate the reset circuit.

StackExchange came through. We can now simulate the electrical characteristics of the 74LVT04 hex inverter.

However, when I attempt to recreate the SOPINE schematic around the RTCLDO, the clusterboard electrical values don’t match real measurements, plus I see high-frequency ringing and pulses of high voltage.

SOPINE schematic with clusterboard reset
SOPINE schematic with clusterboard reset

When I read the AXP803 manual carefully, I realize the PWROK is a push-pull line pulling to VCC_RTC internally. What is this? Research shows it is a GPIO line flanked by two transistors (MOSFETs?), one connected to GND, and the other to VCC_RTC. Let’s try to simulate that with ideal MOSFETs.

AXP803 PWROK push-pull simulation
AXP803 PWROK push-pull simulation

By putting a probe between R16 and C79 on the SOPINE, we can measure the PWROK line voltage. There is a difference between the baseboard and the clusterboard:

  • Baseboard PWROK: 3.00V
  • Clusterboard PWROK: 3.24V
  • Clusterboard R358: -2mV
  • Clusterboard RESET_A, SOPINE inserted: 3.29V
  • Clusterboard RESET_A, no SOPINE: 3.29V
SOPINE PWROK voltage probe point
SOPINE PWROK voltage probe point

Even though the push-pull MOSFETs are not correct, we can simulate the 74LVT04 feeding 3.3V into the PWROK line which is supposed to be a max of 3.00V. This is as far as we can simulate, however. We can see how complex this is becoming, and how many unknown components there are to guess.

SOPINE power simulation is now too complex
SOPINE power simulation is now too complex
The PMIC is a reactive component with an LDO and unknown push-pull circuit. Can we empirically discover more electrical differences instead?

Simulations aren’t panning out. We don’t have any information on the design of the push-pull and how it reacts to back-EMF from the 74LVT04. We just know 3.3V of back-EMF prevents a WDT reset. Let’s try something else. Now that we know how to measure the PWROK line (and RESET input), let’s graph the voltage around resets.

PWROK clusterboard voltage during reset. Left: no RTC battery. Right: Normal and RTC battery
PWROK clusterboard voltage during reset. Left: no RTC battery. Right: Normal and RTC battery
Clusterboard PWROK voltages after WDT reset
Clusterboard PWROK voltages after WDT reset
Wow. On the clusterboard during a WDT reset, the PWROK line drops from 3.24V to 2.25V and never recovers unless a hardware reset is performed.

This is a wonderful discovery. It means we are on the right track. The left waveform is self-explanatory. The right waveform is more interesting. Here are the observations (15A power, one SOPINE, clusterboard):

Left PWROK waveform:

  • No RTC battery, WDT reset fires after 2s.

Right PWROK waveform:

  • No RTC battery, no WDT reset.
  • RTC battery, no WDT reset.
  • RTC battery, WDT reset fires after 2s.

The right waveform is identical for the above three observations as well. Only the left waveform is unusual for normal operations.

What constitutes a HIGH signal in the A64?

Turning to the electrical guide of the A64, we see that VCC-IO can range from 3.0V to 3.6V.

A64 VCC-IO limits
A64 VCC-IO limits

In the SOPINE schematic, VCC-IO is supplied by DCDC1 from the PMIC, which is regulated at 3.3V. Then, on page 34 of the datasheet, we see the following table.

A64 DC electrical characteristics table
A64 DC electrical characteristics table

Then, 0.7 * 3.3V = 2.31V. Thus, the minimum threshold for a HIGH signal (no reset) on the A64 RESET# line from PWROK is 2.31V, yet PWROK drops to 2.25V – still HIGH?

But is it LOW? The upper threshold for a LOW signal is 0.3 * 3.3V = 0.99V. What does inverting-logic RESET do when the signal is between HIGH and LOW? The datasheet indicates the line is not pulled up or down. Normally the region between logic thresholds is undefined, but RESET is inverted. Here is a possible logic inverter with a single Darlington pair (I chose values for saturation at 2.3V to show the effect clearly).

Possible RESET line buffer/inverter in the A64 at 2.3V saturation
Possible RESET line buffer/inverter in the A64 at 2.3V saturation

Among friends, let’s say that under 2.3V a RESET will occur. Now we can move on to how to solve this problem without 2xAA batteries by preventing the voltage drop at all.

The RTC is not the problem. Let’s focus on NMI and RESET from the PMIC and the back-EMF of 3.3V.

How to solve this back-EMF from the clusterboard reset distributor without resorting to SMD re-soldering (or batteries)? Below is a fun animation I made while I wait for some custom testing hardware to arrive.

Clusterboard reset distributor PCB layers
Clusterboard reset distributor PCB layers (source)

This is still a work in progress to understand why the batteries allow the clusterboard SOPINEs to individually restart.

Notes:

  1. The ‘#’ denotes assertion or activation when the line goes low.