Hey, I’ve been experimenting with the 4x N300s service on region North America, but when I redeployed and tried to mesh with tt-topology again, I got this error:
It seems to be an issue with corrupted memory in one of the chips. How is this possible?
Additional info:
Log text:
Working in temporary directory: /tmp/tmp.BAaMvLMcxM
---
Creating Python virtual environment...
Virtual environment activated.
---
Installing tt-topology from git...
[notice] A new release of pip is available: 23.0.1 -> 26.1
[notice] To update, run: pip install --upgrade pip
tt-topology installed.
---
Running tt-topology command. This may take a moment...
Detected Chips: 4
Traceback (most recent call last):
File "/tmp/tmp.BAaMvLMcxM/tt-topology-venv/lib/python3.10/site-packages/tt_topology/tt_topology.py", line 560, in main
run_and_flash(topo_backend)
File "/tmp/tmp.BAaMvLMcxM/tt-topology-venv/lib/python3.10/site-packages/tt_topology/tt_topology.py", line 131, in run_and_flash
topo_backend.get_eth_config_state()
File "/tmp/tmp.BAaMvLMcxM/tt-topology-venv/lib/python3.10/site-packages/tt_topology/backend.py", line 250, in get_eth_config_state
data["fw_version"] == config_state[0]["fw_version"]
AssertionError: Firmware versions do not match: 0x6072000 != 0xffffffff
Saved json log file to /root/tt_topology_logs/05-04-2026_11:38:09_log.json
!!! ERROR: Failed to configure mesh topology
---
Cleaning up...
Removing temporary directory: /tmp/tmp.BAaMvLMcxM
Cleanup complete.
The full command I used (from here):
TMP_DIR=$(mktemp -d); (trap 'echo "---"; echo "Cleaning up..."; if type deactivate &>/dev/null; then deactivate; fi; echo "Removing temporary directory: $TMP_DIR"; rm -rf "$TMP_DIR"; cd; echo "Cleanup complete."' EXIT; trap 'echo -e "\033[0;31m!!! ERROR: Failed to configure mesh topology\033[0m"' ERR; set -e; cd "$TMP_DIR"; echo "Working in temporary directory: $TMP_DIR"; echo "---"; echo "Creating Python virtual environment..."; python3 -m venv tt-topology-venv; source tt-topology-venv/bin/activate; echo "Virtual environment activated."; echo "---"; echo "Installing tt-topology from git..."; pip install --quiet git+https://github.com/tenstorrent/tt-topology.git; echo "tt-topology installed."; echo "---"; echo "Running tt-topology command. This may take a moment..."; tt-topology -l mesh; echo "---"; echo "Script finished successfully.";)
But the important part is tt-topology -l mesh.
More info: I tried a reset but no luck:
root@36103f9f:~# sudo tt-smi -r 0 1 2 3
Starting reset on devices at PCI indices: 0, 1, 2, 3
Waiting for 2 seconds for potential hotplug removal.
Waiting for devices to reappear on pci bus...
Reset successfully completed for device at PCI index 0.
Waiting for devices to reappear on pci bus...
Reset successfully completed for device at PCI index 1.
Waiting for devices to reappear on pci bus...
Reset successfully completed for device at PCI index 2.
Waiting for devices to reappear on pci bus...
Reset successfully completed for device at PCI index 3.
Finishing reset on devices at PCI indices: 0, 1, 2, 3
Re-initializing boards after reset....
Detected Chips: 1
Detecting ARC: -
Detecting DRAM: -
[] [0/16] ETH: -
Error when re-initializing chips!
Chip initialization failed:
Communication Status: Success
DRAM Status: Timeout, 4 out of 4 initialized
CPU Status: Timeout
ARC Status: Timeout, 1 out of 1 initialized
Ethernet Status: Timeout, 0 out of 16 initialized
Noc Safe: false
Unknown State: false
0: luwen_api::chip::init::wait_for_init
1: luwen_api::detect_chips::detect_chips
2: pyluwen::detect_chips_fallible
3: pyluwen::_::__pyfunction_detect_chips_fallible
4: pyo3::impl_::trampoline::trampoline
5: pyluwen::_::<impl pyluwen::detect_chips_fallible::MakeDef>::DEF::trampoline
6: cfunction_vectorcall_FASTCALL_KEYWORDS
at /tmp/Python-3.10.15/Objects/methodobject.c:446:24
7: _PyObject_VectorcallTstate
at /tmp/Python-3.10.15/./Include/cpython/abstract.h:114:11
8: PyObject_Vectorcall
at /tmp/Python-3.10.15/./Include/cpython/abstract.h:123:12
9: call_function
at /tmp/Python-3.10.15/Python/ceval.c:5893:13
10: _PyEval_EvalFrameDefault
at /tmp/Python-3.10.15/Python/ceval.c:4231:19
11: _PyEval_EvalFrame
at /tmp/Python-3.10.15/./Include/internal/pycore_ceval.h:46:12
12: _PyEval_Vector
at /tmp/Python-3.10.15/Python/ceval.c:5067:24
13: _PyObject_VectorcallTstate
at /tmp/Python-3.10.15/./Include/cpython/abstract.h:114:11
14: PyObject_Vectorcall
at /tmp/Python-3.10.15/./Include/cpython/abstract.h:123:12
15: call_function
at /tmp/Python-3.10.15/Python/ceval.c:5893:13
16: _PyEval_EvalFrameDefault
at /tmp/Python-3.10.15/Python/ceval.c:4231:19
17: _PyEval_EvalFrame
at /tmp/Python-3.10.15/./Include/internal/pycore_ceval.h:46:12
18: _PyEval_Vector
at /tmp/Python-3.10.15/Python/ceval.c:5067:24
19: _PyObject_VectorcallTstate
at /tmp/Python-3.10.15/./Include/cpython/abstract.h:114:11
20: PyObject_Vectorcall
at /tmp/Python-3.10.15/./Include/cpython/abstract.h:123:12
21: call_function
at /tmp/Python-3.10.15/Python/ceval.c:5893:13
22: _PyEval_EvalFrameDefault
at /tmp/Python-3.10.15/Python/ceval.c:4231:19
23: _PyEval_EvalFrame
at /tmp/Python-3.10.15/./Include/internal/pycore_ceval.h:46:12
24: _PyEval_Vector
at /tmp/Python-3.10.15/Python/ceval.c:5067:24
25: _PyObject_VectorcallTstate
at /tmp/Python-3.10.15/./Include/cpython/abstract.h:114:11
26: PyObject_Vectorcall
at /tmp/Python-3.10.15/./Include/cpython/abstract.h:123:12
27: call_function
at /tmp/Python-3.10.15/Python/ceval.c:5893:13
28: _PyEval_EvalFrameDefault
at /tmp/Python-3.10.15/Python/ceval.c:4181:23
29: _PyEval_EvalFrame
at /tmp/Python-3.10.15/./Include/internal/pycore_ceval.h:46:12
30: _PyEval_Vector
at /tmp/Python-3.10.15/Python/ceval.c:5067:24
31: PyEval_EvalCode
at /tmp/Python-3.10.15/Python/ceval.c:1134:12
32: <unknown>
33: <unknown>
34: <unknown>
35: __libc_start_main
36: <unknown>
So the devices are detected at PCI level, the reset completes successfully for all PCI indices, but after reset, only 1 chip is detected.
Is a full power-cycle or a firmware reflash possible? It is very annoying to redeploy my service about 20 times in the hope I don’t get this one broken Loudbox 