CTYun Logpush Backfill Guide

This is the lightweight backfill design. It replays historical R2 logs inside a precise time window, drops Worker subrequests, transforms each record to the CTYun partner format, and sends batches to the customer endpoint with explicit rate control.

Customer Constraints

Three customer-side requirements drive the entire design:

1. Receiver throughput cap (~100,000 lines/s)

The customer endpoint cannot absorb arbitrary throughput. Sender is hard-capped at:

max_concurrency × (1000ms / MIN_SENDER_INVOCATION_MS) × BATCH_SIZE
= 20 × (1000 / 200) × 1000
= 100,000 lines/s

This is a theoretical ceiling, assuming each batch fills BATCH_SIZE = 1000 lines and customer ACK latency is below 200ms. Actual sustained throughput is also bounded by Parser upstream supply — see Throughput & Tuning below.

2. Receiver does NOT dedupe

Delivery is at-least-once, not exactly-once. The receiver cannot deduplicate, so the Worker minimizes duplicates with two on-R2 markers (.done and .queued). These cover the common cases (Parser retry, Queue redelivery), but two residual scenarios remain:

fetch abort after receiver already processed — SEND_TIMEOUT_MS triggers, but the POST had already been accepted by the receiver. Worker has no way to know. Retry → receiver gets the batch a second time. Probability under normal conditions: < 0.001%.
.done write fails after a successful POST — R2 PUT permanent failure (3 internal retries already applied). Without .done the next retry resends. Probability: < 0.0001%.

Each batch carries a stable batch identity in its R2 key. Receivers that can dedupe should key on it. Receivers that cannot must accept the residual duplicate rate as a known risk.

3. Drop Worker subrequests

The customer is billed/audited per top-level request only. The Parser filters out per-record:

ParentRayID !== "00"        // i.e. it's a subrequest
WorkerSubrequest === true   // explicit subrequest flag

Both checks run before the format transform.

Coexistence with Production Worker

The production worker ctyun-logpush-worker sends real-time logs to the same customer endpoint at a similar rate. Backfill is fully isolated at the Cloudflare layer:

Resource	Production	Backfill	Shared?
Worker isolate / CPU budget	`ctyun-logpush`	`ctyun-logpush-backfill`	Independent
Queues	`parse-queue` / `send-queue`	`parse-queue-backfill` / `send-queue-backfill`	Independent
R2 bucket	`cdn-logs-raw`	`cdn-logs-raw`	Shared (different prefixes)
R2 prefix	`processed/`	`processed-backfill/<run-id>/`	Independent
Customer endpoint	shared	shared	Shared (downstream pressure)

Implication: a stuck backfill (e.g. customer endpoint hangs) does not consume production Worker concurrency or queue slots. Both workers share the receiver's bandwidth, so backfill's explicit rate cap is what protects the receiver.

Architecture

R2 logs/
-> parse-queue-backfill
-> Parser  (gzip decode + filter + transform + chunk into 1000-line batches)
-> processed-backfill/<run-id>/{batch}.txt + {batch}.queued
-> send-queue-backfill
-> Sender  (gzip + signed POST + .done marker + cleanup)
-> customer endpoint

No Durable Object is used in the hot path. Each raw .gz file is parsed independently and written into final batch files directly. The last batch of each file may be under-filled (fewer than 1000 lines) — this is acceptable.

What It Guarantees

Precise replay inside [BACKFILL_START_TIME, BACKFILL_END_TIME]
Only top-level requests are forwarded: ParentRayID = "00" and WorkerSubrequest != true
Authenticated POSTs using the same CTYun secrets as production
Sender-side rate cap through queue concurrency + a 200ms per-invocation floor
Temporary batch files are cleaned automatically after the run completes (2h grace period)

Deploy

Variable	Default	Meaning
`BACKFILL_START_TIME`	`""`	Replay start time, ISO 8601
`BACKFILL_END_TIME`	`""`	Replay end time, ISO 8601, must be `<= now`
`BACKFILL_ENABLED`	`"false"`	Must be `"true"` to run
`BACKFILL_RATE`	`"60"`	Files enqueued per cron minute (max 100)
`BATCH_SIZE`	`"1000"`	Lines per POST. Do not increase without confirming receiver capacity.
`SEND_TIMEOUT_MS`	`"300000"`	Max wait for customer ACK. Customer-tested ACK is 1–5s; long timeout is intentional to minimize abort-induced duplicates.
`LOG_PREFIX`	`"logs/"`	Logpush root prefix in R2
`LOG_LEVEL`	`"info"`	`debug` / `info` / `warn` / `error`

Maximum replay window: 7 days (MAX_RANGE_HOURS = 24 * 7). Larger windows must be split into multiple runs.

wrangler queues create parse-queue-backfill
wrangler queues create send-queue-backfill
wrangler queues create parse-dlq-backfill
wrangler queues create send-dlq-backfill

wrangler secret put CTYUN_ENDPOINT --config wrangler-backfill.toml
wrangler secret put CTYUN_PRIVATE_KEY --config wrangler-backfill.toml
wrangler secret put CTYUN_URI_EDGE --config wrangler-backfill.toml

wrangler deploy --config wrangler-backfill.toml

Throughput & Tuning

The pipeline has two independent ceilings — the lower one wins. The numbers below come from a real customer run (vivo, 651 raw files, 95-min replay window, 27,908 batches, completed in ~2.5 hours).

Stage	Configuration	Theoretical ceiling	Real-world throughput
Sender	`max_concurrency=20` × ~0.55s per invocation (incl. ACK ~400ms)	~36 batch/s	~3 batch/s (8% utilization — idle most of the time)
Parser	`max_concurrency=10` × ~150s wall time per 100K-line file × ~50 batches/file	~3.3 batch/s	~3 batch/s (93% utilization — fully loaded)

Under default config the Parser is the bottleneck and Sender is mostly idle. Sustained throughput is around 3 batch/s ≈ 3,000 lines/s, far below the Sender's 100,000 nameplate. The Parser slowness is dominated by the 5 sequential R2 operations per batch (head .done + head .queued + put batchKey + put .queued + send queue), each averaging ~588 ms inside a long-running parser invocation. With ~50 batches per 100K-line file, that's ~150 seconds of I/O wall time per file.

If you need to finish faster, the only safe lever is raising Parser max_concurrency:

Parser `max_concurrency`	Expected sustained throughput	Expected duration for the same 651-file run	Risk
`10` (default)	~3 batch/s	~2.5 hours	None — current default
`20`	~6 batch/s	~1.4 hours	Low — R2 op count doubles, still far from any platform limit
`30`	~9 batch/s	~55 min	Medium — sustained R2 binding pressure may degrade per-op latency at this density
`50+`	uncertain	uncertain	Not recommended — per-op latency may grow non-linearly

Do not raise Sender max_concurrency above 20 or shorten MIN_SENDER_INVOCATION_MS below 200. Sender already has plenty of headroom — the bottleneck is upstream. Touching these would only risk pressuring the receiver and the production worker on the same endpoint.

For large raw files, parse queue is intentionally configured as max_batch_size = 1, so each large gzip file gets its own invocation instead of sharing CPU time with two other large files.

Note: the per-batch parser I/O cost (~588 ms / R2 op × 5 ops) is a property of running many sequential R2 ops inside a single long Worker invocation, not of any individual operation. Sender single-batch invocations issue similar R2 calls in <100 ms total because each invocation is short-lived.

Monitoring

There are two independent monitoring channels: the HTTP status endpoint (aggregated progress) and Worker logs (per-batch send evidence).

1. `/backfill/status` — aggregated progress

curl https://ctyun-logpush-backfill.<your-subdomain>.workers.dev/backfill/status

Key fields:

Field	Meaning
`summary`	One-line plain-language status
`stage_code`	`scanning_and_queueing` / `sending` / `sent_waiting_cleanup` / `cleaned`
`delivery_completed`	`true` when `batches_pending == 0` and all batches sent
`fully_completed`	`true` only after temporary files are also cleaned
`matched_raw_files`	Total raw files matched in the time window
`batches_sent` / `batches_pending`	Sent vs pending batch count
`log_lines_sent`	Total lines delivered to the receiver
`batch_lines_avg` / `_min` / `_max`	Distribution of batch sizes (lines per POST)
`replay_window_beijing`	Replay window in Beijing time
`last_refresh_beijing`	Last status update time

Use GET /backfill/status?view=raw for the raw state document plus low-level artifact stats.

2. Worker logs — per-batch send evidence

wrangler tail ctyun-logpush-backfill

Every successful Sender invocation emits one line like:

[BACKFILL][INFO] Sent batch lines=1000 bytes=1064735 (uncompressed)
  -> HTTP 200 | ack_ms=408 queue_wait_ms=525 | processed-backfill/.../batch.txt

Field	Meaning
`ack_ms`	Time from `fetch()` start to receiver HTTP response (round-trip + receiver processing)
`queue_wait_ms`	Time the batch sat in send-queue (from Parser enqueue to Sender pick-up + send)
`bytes`	Uncompressed payload size; gzip-compressed body is ~10× smaller

Parser also emits Parsing: ... and Done: ... per file, which lets you measure single-file wall time (subtract the two timestamps).

Files that exist in `backfill-state/`

File	When written
`progress.json`	Updated by every cron tick (~1/min); always present once a run starts
`status.json`	Written only when someone hits `GET /backfill/status`. If no one ever curls the endpoint, this file does not exist in R2.

Files that do NOT exist in the current code: send-stats.json (aggregate ack_ms/queue_wait_ms statistics) was removed in the queue-only refactor. Per-batch ack_ms / queue_wait_ms are emitted to Worker logs only (see channel 2 above) — they are not aggregated into any R2 file. If you see a send-stats.json in an old bucket, it is leftover data from a previous code version.

Cleanup Lifecycle

After all batches are delivered:

inspectRunArtifacts — list every batch artifact and verify .done exists
cleanup.status = ready — entered when all artifacts confirmed sent
2-hour grace period — temporary files retained for inspection
cleanup.status = deleting — bulk delete processed-backfill/<run-id>/*
cleanup.status = done — state marked cleaned, run is fully complete

The grace period is intentional: it gives operators a short window to inspect what was sent before files are removed.

Re-running the Same Time Window

If you need to replay the exact same [A, B] window after a successful run:

# 1. Delete state files
wrangler r2 object delete cdn-logs-raw/backfill-state/progress.json
wrangler r2 object delete cdn-logs-raw/backfill-state/status.json

# 2. Either redeploy, or wait for the next cron tick (BACKFILL_ENABLED stays "true")

Old processed-backfill/<run-id>/ prefixes do not block a fresh run — each run uses a new run_id (timestamp-suffixed).

Running a Different Time Window on the Same Zone

This is the normal repeat-run path. Once the previous run is cleaned (or fully_completed = true), just update BACKFILL_START_TIME / BACKFILL_END_TIME and deploy again.

No manual deletion of processed-backfill/<old-run-id>/ is required.
No manual deletion of backfill-state/progress.json / status.json is required.
No queue draining is required on the normal path.

The Worker only auto-reinitializes when the previous state is already cleaned. If a run is still running / done, do not switch the time window yet.

Finished?

The safest completion rule:

/backfill/status shows fully_completed = true
send-queue-backfill backlog is zero (visible in the Cloudflare Dashboard)