CTYun Logpush Backfill Guide

This is the lightweight backfill design. It replays historical R2 logs inside a precise time window, drops Worker subrequests, transforms each record to the CTYun partner format, and sends batches to the customer endpoint with explicit rate control.

Customer Constraints

Three customer-side requirements drive the entire design:

1. Receiver throughput cap (~100,000 lines/s)

The customer endpoint cannot absorb arbitrary throughput. Sender is hard-capped at:

max_concurrency × (1000ms / MIN_SENDER_INVOCATION_MS) × BATCH_SIZE
= 20 × (1000 / 200) × 1000
= 100,000 lines/s

This is a theoretical ceiling, assuming each batch fills BATCH_SIZE = 1000 lines and customer ACK latency is below 200ms. Actual sustained throughput is also bounded by Parser upstream supply — see Throughput & Tuning below.

2. Receiver does NOT dedupe

Delivery is at-least-once, not exactly-once. The receiver cannot deduplicate, so the Worker minimizes duplicates with two on-R2 markers (.done and .queued). These cover the common cases (Parser retry, Queue redelivery), but two residual scenarios remain: Each batch carries a stable batch identity in its R2 key. Receivers that can dedupe should key on it. Receivers that cannot must accept the residual duplicate rate as a known risk.

3. Drop Worker subrequests

The customer is billed/audited per top-level request only. The Parser filters out per-record:

ParentRayID !== "00"        // i.e. it's a subrequest
WorkerSubrequest === true   // explicit subrequest flag

Both checks run before the format transform.

Coexistence with Production Worker

The production worker ctyun-logpush-worker sends real-time logs to the same customer endpoint at a similar rate. Backfill is fully isolated at the Cloudflare layer:

ResourceProductionBackfillShared?
Worker isolate / CPU budgetctyun-logpushctyun-logpush-backfillIndependent
Queuesparse-queue / send-queueparse-queue-backfill / send-queue-backfillIndependent
R2 bucketcdn-logs-rawcdn-logs-rawShared (different prefixes)
R2 prefixprocessed/processed-backfill/<run-id>/Independent
Customer endpointsharedsharedShared (downstream pressure)

Implication: a stuck backfill (e.g. customer endpoint hangs) does not consume production Worker concurrency or queue slots. Both workers share the receiver's bandwidth, so backfill's explicit rate cap is what protects the receiver.

Architecture

R2 logs/
-> parse-queue-backfill
-> Parser  (gzip decode + filter + transform + chunk into 1000-line batches)
-> processed-backfill/<run-id>/{batch}.txt + {batch}.queued
-> send-queue-backfill
-> Sender  (gzip + signed POST + .done marker + cleanup)
-> customer endpoint
No Durable Object is used in the hot path. Each raw .gz file is parsed independently and written into final batch files directly. The last batch of each file may be under-filled (fewer than 1000 lines) — this is acceptable.

What It Guarantees

Deploy

VariableDefaultMeaning
BACKFILL_START_TIME""Replay start time, ISO 8601
BACKFILL_END_TIME""Replay end time, ISO 8601, must be <= now
BACKFILL_ENABLED"false"Must be "true" to run
BACKFILL_RATE"60"Files enqueued per cron minute (max 100)
BATCH_SIZE"1000"Lines per POST. Do not increase without confirming receiver capacity.
SEND_TIMEOUT_MS"300000"Max wait for customer ACK. Customer-tested ACK is 1–5s; long timeout is intentional to minimize abort-induced duplicates.
LOG_PREFIX"logs/"Logpush root prefix in R2
LOG_LEVEL"info"debug / info / warn / error

Maximum replay window: 7 days (MAX_RANGE_HOURS = 24 * 7). Larger windows must be split into multiple runs.

wrangler queues create parse-queue-backfill
wrangler queues create send-queue-backfill
wrangler queues create parse-dlq-backfill
wrangler queues create send-dlq-backfill

wrangler secret put CTYUN_ENDPOINT --config wrangler-backfill.toml
wrangler secret put CTYUN_PRIVATE_KEY --config wrangler-backfill.toml
wrangler secret put CTYUN_URI_EDGE --config wrangler-backfill.toml

wrangler deploy --config wrangler-backfill.toml

Throughput & Tuning

The pipeline has two independent ceilings — the lower one wins. The numbers below come from a real customer run (vivo, 651 raw files, 95-min replay window, 27,908 batches, completed in ~2.5 hours).

StageConfigurationTheoretical ceilingReal-world throughput
Sendermax_concurrency=20 × ~0.55s per invocation (incl. ACK ~400ms)~36 batch/s~3 batch/s (8% utilization — idle most of the time)
Parsermax_concurrency=10 × ~150s wall time per 100K-line file × ~50 batches/file~3.3 batch/s~3 batch/s (93% utilization — fully loaded)
Under default config the Parser is the bottleneck and Sender is mostly idle. Sustained throughput is around 3 batch/s ≈ 3,000 lines/s, far below the Sender's 100,000 nameplate. The Parser slowness is dominated by the 5 sequential R2 operations per batch (head .done + head .queued + put batchKey + put .queued + send queue), each averaging ~588 ms inside a long-running parser invocation. With ~50 batches per 100K-line file, that's ~150 seconds of I/O wall time per file.

If you need to finish faster, the only safe lever is raising Parser max_concurrency:

Parser max_concurrencyExpected sustained throughputExpected duration for the same 651-file runRisk
10 (default)~3 batch/s~2.5 hoursNone — current default
20~6 batch/s~1.4 hoursLow — R2 op count doubles, still far from any platform limit
30~9 batch/s~55 minMedium — sustained R2 binding pressure may degrade per-op latency at this density
50+uncertainuncertainNot recommended — per-op latency may grow non-linearly
Do not raise Sender max_concurrency above 20 or shorten MIN_SENDER_INVOCATION_MS below 200. Sender already has plenty of headroom — the bottleneck is upstream. Touching these would only risk pressuring the receiver and the production worker on the same endpoint.

For large raw files, parse queue is intentionally configured as max_batch_size = 1, so each large gzip file gets its own invocation instead of sharing CPU time with two other large files.

Note: the per-batch parser I/O cost (~588 ms / R2 op × 5 ops) is a property of running many sequential R2 ops inside a single long Worker invocation, not of any individual operation. Sender single-batch invocations issue similar R2 calls in <100 ms total because each invocation is short-lived.

Monitoring

There are two independent monitoring channels: the HTTP status endpoint (aggregated progress) and Worker logs (per-batch send evidence).

1. /backfill/status — aggregated progress

curl https://ctyun-logpush-backfill.<your-subdomain>.workers.dev/backfill/status

Key fields:

FieldMeaning
summaryOne-line plain-language status
stage_codescanning_and_queueing / sending / sent_waiting_cleanup / cleaned
delivery_completedtrue when batches_pending == 0 and all batches sent
fully_completedtrue only after temporary files are also cleaned
matched_raw_filesTotal raw files matched in the time window
batches_sent / batches_pendingSent vs pending batch count
log_lines_sentTotal lines delivered to the receiver
batch_lines_avg / _min / _maxDistribution of batch sizes (lines per POST)
replay_window_beijingReplay window in Beijing time
last_refresh_beijingLast status update time

Use GET /backfill/status?view=raw for the raw state document plus low-level artifact stats.

2. Worker logs — per-batch send evidence

wrangler tail ctyun-logpush-backfill

Every successful Sender invocation emits one line like:

[BACKFILL][INFO] Sent batch lines=1000 bytes=1064735 (uncompressed)
  -> HTTP 200 | ack_ms=408 queue_wait_ms=525 | processed-backfill/.../batch.txt
FieldMeaning
ack_msTime from fetch() start to receiver HTTP response (round-trip + receiver processing)
queue_wait_msTime the batch sat in send-queue (from Parser enqueue to Sender pick-up + send)
bytesUncompressed payload size; gzip-compressed body is ~10× smaller

Parser also emits Parsing: ... and Done: ... per file, which lets you measure single-file wall time (subtract the two timestamps).

Files that exist in backfill-state/

FileWhen written
progress.jsonUpdated by every cron tick (~1/min); always present once a run starts
status.jsonWritten only when someone hits GET /backfill/status. If no one ever curls the endpoint, this file does not exist in R2.

Files that do NOT exist in the current code: send-stats.json (aggregate ack_ms/queue_wait_ms statistics) was removed in the queue-only refactor. Per-batch ack_ms / queue_wait_ms are emitted to Worker logs only (see channel 2 above) — they are not aggregated into any R2 file. If you see a send-stats.json in an old bucket, it is leftover data from a previous code version.

Cleanup Lifecycle

After all batches are delivered:

  1. inspectRunArtifacts — list every batch artifact and verify .done exists
  2. cleanup.status = ready — entered when all artifacts confirmed sent
  3. 2-hour grace period — temporary files retained for inspection
  4. cleanup.status = deleting — bulk delete processed-backfill/<run-id>/*
  5. cleanup.status = done — state marked cleaned, run is fully complete

The grace period is intentional: it gives operators a short window to inspect what was sent before files are removed.

Re-running the Same Time Window

If you need to replay the exact same [A, B] window after a successful run:

# 1. Delete state files
wrangler r2 object delete cdn-logs-raw/backfill-state/progress.json
wrangler r2 object delete cdn-logs-raw/backfill-state/status.json

# 2. Either redeploy, or wait for the next cron tick (BACKFILL_ENABLED stays "true")

Old processed-backfill/<run-id>/ prefixes do not block a fresh run — each run uses a new run_id (timestamp-suffixed).

Running a Different Time Window on the Same Zone

This is the normal repeat-run path. Once the previous run is cleaned (or fully_completed = true), just update BACKFILL_START_TIME / BACKFILL_END_TIME and deploy again.

The Worker only auto-reinitializes when the previous state is already cleaned. If a run is still running / done, do not switch the time window yet.

Finished?

The safest completion rule: