Schedules, Gateway, and the Event Loop

Q: "Why did schedules crash the server four times?"

"The agent execution model wasn't designed for concurrent background work. Each scheduled agent spins up an LLM context, tool registry, and session state. Running them in parallel from a setInterval callback starved the Node.js event loop — HTTP requests couldn't get through. We tried sequential execution, then tracing, then disabling entirely, before finding the real bug: an infinite loop in calculateNextExecution that never yielded."

Q: "What's the Gateway module?"

"It's a single endpoint that connects Telegram bots and WhatsApp numbers to your LikeClaw workspace. You configure a bot, scan a QR code, and messages route to your agents. Think of it as a bridge — your agents don't know they're talking to Telegram. They just see a chat session."

Q: "Why add MCP servers as standalone entities?"

"Before this change, MCP servers were always attached to a skill. If you wanted to use a standalone MCP server — say, a local filesystem server or a custom API — you had to wrap it in a skill first. That's unnecessary friction. Now you can register MCP servers directly and reference them with # autocomplete in the chat input, same as skills."

Q: "What changed with Apple billing?"

"Apple deprecated the /verifyReceipt endpoint that we'd been using since launch. We migrated to StoreKit 2 with JWS (JSON Web Signature) verification. The new system is cryptographically signed — we verify the signature locally instead of calling Apple's server on every receipt. Faster and more reliable."

54 commits in 9 days: scheduled agents, Telegram/WhatsApp access, MCP servers, and a bug that killed the server four times.

February 25, 2026. The day after we shipped iOS support and cloud mode. The plan was stabilization. Bug fixes. Catching our breath after 74 commits in a week.

That lasted about six hours.

The scheduling saga

We’d built scheduled agents months ago. The concept is simple: tell an agent to run at a specific time, and it runs. A feed creation agent that checks sources every morning. A monitoring agent that scans for changes every hour. The kind of automation that makes AI agents actually useful instead of just impressive.

The implementation had been anything but simple.

The original architecture stored schedule definitions in two places — the VFS (virtual filesystem) and MongoDB. This made sense when VFS was the primary storage layer, but it meant every schedule operation had to sync between two databases. On March 2, we ripped that out. Direct MongoDB CRUD, no dual-storage. One source of truth.

Then we tried to actually run them.

Four crashes in one day

March 4 was the day schedules almost won.

Attempt 1: Enable schedule execution. Server freezes within seconds. The event loop is completely blocked. HTTP health checks fail. Cloudflare marks the instance as down. We disable it.

Attempt 2: Execute schedules sequentially instead of in parallel. The theory was sound — if parallel agent executions starve the event loop, run them one at a time. Server freezes anyway. The problem wasn’t parallelism.

Attempt 3: Re-enable with comprehensive production tracing. Wrap every step in performance markers. Log entry and exit of every function in the execution path. Server freezes again, but this time we can see where.

Attempt 4: Disable again. Read the traces. Find it.

The bug was in calculateNextExecution. When a schedule’s cron expression resolved to a next-run time in the past — which happens when the server was down during the scheduled window — the function looped forward to find the next valid time. But the loop didn’t yield. No await, no setImmediate, no way for the event loop to breathe. A single schedule with a missed window would spin the CPU until the process died.

The fix was two lines. Yield on every iteration. Re-enable schedules on March 5. They’ve been running since.

Four disables, three re-enables, one infinite loop. That’s production.

Gateway: Telegram and WhatsApp

While schedules were crashing servers, the Gateway module shipped quietly on February 25.

The idea: your LikeClaw agents should be reachable from wherever you already are. Not everyone wants to open a dashboard. Some people live in Telegram. Some teams coordinate on WhatsApp. The agent shouldn’t care how you reach it.

Gateway is a single NestJS module that handles both platforms. You configure a Telegram bot token or a WhatsApp number, and incoming messages route to your workspace’s agent. The agent sees a normal chat session. It doesn’t know the message came from Telegram. It doesn’t need to.

On February 26, we added QR codes and bot info to the Gateway UI. Scan with your phone, send a message, get an agent response. The configuration is one screen. The complexity is in the routing layer, where it belongs.

MCP servers go standalone

Model Context Protocol servers have been part of LikeClaw since we launched the skills marketplace. But they were always second-class citizens — attached to skills, never existing on their own.

On March 4, MCP servers became first-class entities. Register a server directly. Give it a name and an endpoint. Reference it in the chat input with # autocomplete, the same way you reference skills. The agent picks up the server’s tools automatically.

This matters because the MCP ecosystem is growing fast. People run local MCP servers for filesystem access, database queries, custom APIs. Requiring them to wrap each one in a skill was friction we didn’t need to impose.

The # autocomplete landed the same day. Type # in the chat input and you get a dropdown of available skills and MCP servers. Select one, and it’s included in the agent’s context for that session. Small UX change, big workflow improvement.

The billing grind

Apple deprecated /verifyReceipt. We’d known it was coming but hadn’t prioritized the migration until March 3, when debug logs revealed intermittent verification failures.

The migration to StoreKit 2 JWS took two days. The old flow: user purchases credits, app sends the receipt to our server, server forwards it to Apple’s verification endpoint, Apple responds with the decoded receipt. Three network hops.

The new flow: user purchases credits, app sends the JWS transaction, server verifies the cryptographic signature locally. One network hop. Apple signs the transaction on the device. We verify the signature with Apple’s public key. No more calling Apple’s server for every purchase.

While we were in the billing code, we fixed a subtle bug: duplicate in-app purchases were silently returning the old payment record instead of throwing an error. Now they return a 409 Conflict. The mobile app handles it by showing the existing purchase. Clean.

We also started storing localized prices on payment records. When a user in Brazil buys credits, the record now shows the price in BRL, not USD. Small thing. Matters for purchase history.

External agents and the in-process LLM client

Two infrastructure changes that don’t have flashy UIs but changed how the system works.

External agent configuration. We built CRUD endpoints and an invoke API for connecting external chat systems to LikeClaw agents. An external service can register an agent config, send messages to the invoke endpoint, and get agent responses back. The dashboard got an agent selector dropdown for managing these configs. This is the foundation for white-label integrations — other products embedding LikeClaw agents without building the agent infrastructure themselves.

In-process LLM client. Our backend had been calling its own API through Cloudflare to route LLM requests. Self-calling through a reverse proxy. It sounds wrong because it is wrong. Cloudflare intermittently returned 405 errors on these self-calls, especially under load. The fix: an in-process LLM client that bypasses the network entirely for internal requests. Same routing logic, zero network overhead, no more 405s.

We also made OpenAI API clients static singletons. Before this change, every agent session created a new HTTP client. Under load, the old clients weren’t being garbage collected fast enough, leading to CLOSE_WAIT connection leaks. Static singletons mean one client per provider, reused across all sessions.

Profile, memory, and the small things

March 4 also shipped a profile and memory settings page. Users can now view and edit their stored preferences and agent memory from the dashboard. Previously, the only way to update memory was through the agent itself — tell it to remember something and hope it persists correctly. Now there’s a screen.

The feed system got source URLs. When an agent creates a feed item from a web source, the item now carries the original links. Users can trace back to where the information came from. Attribution that should’ve been there from the start.

What the numbers say

54 commits in 9 days. Version 2.2.0 shipped. Two new messaging channels. A scheduling system that went from crashing to stable. StoreKit 2 migration complete. MCP servers promoted to first-class. External agent APIs live.

The pattern from the last few weeks is holding. Ship fast, break things, fix them the same day. The scheduling saga was the most dramatic — four crashes and a fix in a single day — but it’s also the most representative. We don’t defer problems. We don’t put up “coming soon” pages. We enable the feature, watch it break, and stay online until it doesn’t.

Next up: the schedule system needs a UI for creating and managing schedules from the dashboard. Right now it’s agent-driven only. And Gateway needs to support more platforms beyond Telegram and WhatsApp. The plumbing is in place. Now we build on top of it.

Before

Schedules before this week

Dual-storage architecture with VFS and MongoDB
Parallel execution starving the event loop
Server crashes within seconds of enabling
No messaging integration for mobile users

After

Schedules after this week

Direct MongoDB CRUD, no VFS layer
Sequential execution with infinite-loop guard
Stable production execution with tracing
Telegram and WhatsApp as first-class entry points

Questions about this sprint

Why did schedules crash the server four times?

The agent execution model wasn't designed for concurrent background work. Each scheduled agent spins up an LLM context, tool registry, and session state. Running them in parallel from a setInterval callback starved the Node.js event loop — HTTP requests couldn't get through. We tried sequential execution, then tracing, then disabling entirely, before finding the real bug: an infinite loop in calculateNextExecution that never yielded.

What's the Gateway module?

It's a single endpoint that connects Telegram bots and WhatsApp numbers to your LikeClaw workspace. You configure a bot, scan a QR code, and messages route to your agents. Think of it as a bridge — your agents don't know they're talking to Telegram. They just see a chat session.

Why add MCP servers as standalone entities?

Before this change, MCP servers were always attached to a skill. If you wanted to use a standalone MCP server — say, a local filesystem server or a custom API — you had to wrap it in a skill first. That's unnecessary friction. Now you can register MCP servers directly and reference them with # autocomplete in the chat input, same as skills.

What changed with Apple billing?

Apple deprecated the /verifyReceipt endpoint that we'd been using since launch. We migrated to StoreKit 2 with JWS (JSON Web Signature) verification. The new system is cryptographically signed — we verify the signature locally instead of calling Apple's server on every receipt. Faster and more reliable.