This is the lightweight backfill design. It replays historical R2 logs inside a precise time window, drops Worker subrequests, transforms each record to the CTYun partner format, and sends batches to the customer endpoint with explicit rate control.
Three customer-side requirements drive the entire design:
The customer endpoint cannot absorb arbitrary throughput. Sender is hard-capped at:
max_concurrency × (1000ms / MIN_SENDER_INVOCATION_MS) × BATCH_SIZE
= 20 × (1000 / 200) × 1000
= 100,000 lines/s
This is a theoretical ceiling, assuming each batch fills BATCH_SIZE = 1000 lines and customer ACK latency is below 200ms. Actual sustained throughput is also bounded by Parser upstream supply — see Throughput & Tuning below.
.done and .queued). These cover the common cases (Parser retry, Queue redelivery), but two residual scenarios remain:
SEND_TIMEOUT_MS triggers, but the POST had already been accepted by the receiver. Worker has no way to know. Retry → receiver gets the batch a second time. Probability under normal conditions: < 0.001%..done write fails after a successful POST — R2 PUT permanent failure (3 internal retries already applied). Without .done the next retry resends. Probability: < 0.0001%.The customer is billed/audited per top-level request only. The Parser filters out per-record:
ParentRayID !== "00" // i.e. it's a subrequest
WorkerSubrequest === true // explicit subrequest flag
Both checks run before the format transform.
The production worker ctyun-logpush-worker sends real-time logs to the same customer endpoint at a similar rate. Backfill is fully isolated at the Cloudflare layer:
| Resource | Production | Backfill | Shared? |
|---|---|---|---|
| Worker isolate / CPU budget | ctyun-logpush | ctyun-logpush-backfill | Independent |
| Queues | parse-queue / send-queue | parse-queue-backfill / send-queue-backfill | Independent |
| R2 bucket | cdn-logs-raw | cdn-logs-raw | Shared (different prefixes) |
| R2 prefix | processed/ | processed-backfill/<run-id>/ | Independent |
| Customer endpoint | shared | shared | Shared (downstream pressure) |
Implication: a stuck backfill (e.g. customer endpoint hangs) does not consume production Worker concurrency or queue slots. Both workers share the receiver's bandwidth, so backfill's explicit rate cap is what protects the receiver.
R2 logs/
-> parse-queue-backfill
-> Parser (gzip decode + filter + transform + chunk into 1000-line batches)
-> processed-backfill/<run-id>/{batch}.txt + {batch}.queued
-> send-queue-backfill
-> Sender (gzip + signed POST + .done marker + cleanup)
-> customer endpoint
.gz file is parsed independently and written into final batch files directly. The last batch of each file may be under-filled (fewer than 1000 lines) — this is acceptable.
[BACKFILL_START_TIME, BACKFILL_END_TIME]ParentRayID = "00" and WorkerSubrequest != true| Variable | Default | Meaning |
|---|---|---|
BACKFILL_START_TIME | "" | Replay start time, ISO 8601 |
BACKFILL_END_TIME | "" | Replay end time, ISO 8601, must be <= now |
BACKFILL_ENABLED | "false" | Must be "true" to run |
BACKFILL_RATE | "60" | Files enqueued per cron minute (max 100) |
BATCH_SIZE | "1000" | Lines per POST. Do not increase without confirming receiver capacity. |
SEND_TIMEOUT_MS | "300000" | Max wait for customer ACK. Customer-tested ACK is 1–5s; long timeout is intentional to minimize abort-induced duplicates. |
LOG_PREFIX | "logs/" | Logpush root prefix in R2 |
LOG_LEVEL | "info" | debug / info / warn / error |
Maximum replay window: 7 days (MAX_RANGE_HOURS = 24 * 7). Larger windows must be split into multiple runs.
wrangler queues create parse-queue-backfill
wrangler queues create send-queue-backfill
wrangler queues create parse-dlq-backfill
wrangler queues create send-dlq-backfill
wrangler secret put CTYUN_ENDPOINT --config wrangler-backfill.toml
wrangler secret put CTYUN_PRIVATE_KEY --config wrangler-backfill.toml
wrangler secret put CTYUN_URI_EDGE --config wrangler-backfill.toml
wrangler deploy --config wrangler-backfill.toml
The pipeline has two independent ceilings — the lower one wins. The numbers below come from a real customer run (vivo, 651 raw files, 95-min replay window, 27,908 batches, completed in ~2.5 hours).
| Stage | Configuration | Theoretical ceiling | Real-world throughput |
|---|---|---|---|
| Sender | max_concurrency=20 × ~0.55s per invocation (incl. ACK ~400ms) | ~36 batch/s | ~3 batch/s (8% utilization — idle most of the time) |
| Parser | max_concurrency=10 × ~150s wall time per 100K-line file × ~50 batches/file | ~3.3 batch/s | ~3 batch/s (93% utilization — fully loaded) |
If you need to finish faster, the only safe lever is raising Parser max_concurrency:
Parser max_concurrency | Expected sustained throughput | Expected duration for the same 651-file run | Risk |
|---|---|---|---|
10 (default) | ~3 batch/s | ~2.5 hours | None — current default |
20 | ~6 batch/s | ~1.4 hours | Low — R2 op count doubles, still far from any platform limit |
30 | ~9 batch/s | ~55 min | Medium — sustained R2 binding pressure may degrade per-op latency at this density |
50+ | uncertain | uncertain | Not recommended — per-op latency may grow non-linearly |
max_concurrency above 20 or shorten MIN_SENDER_INVOCATION_MS below 200. Sender already has plenty of headroom — the bottleneck is upstream. Touching these would only risk pressuring the receiver and the production worker on the same endpoint.
For large raw files, parse queue is intentionally configured as max_batch_size = 1, so each large gzip file gets its own invocation instead of sharing CPU time with two other large files.
Note: the per-batch parser I/O cost (~588 ms / R2 op × 5 ops) is a property of running many sequential R2 ops inside a single long Worker invocation, not of any individual operation. Sender single-batch invocations issue similar R2 calls in <100 ms total because each invocation is short-lived.
There are two independent monitoring channels: the HTTP status endpoint (aggregated progress) and Worker logs (per-batch send evidence).
/backfill/status — aggregated progresscurl https://ctyun-logpush-backfill.<your-subdomain>.workers.dev/backfill/status
Key fields:
| Field | Meaning |
|---|---|
summary | One-line plain-language status |
stage_code | scanning_and_queueing / sending / sent_waiting_cleanup / cleaned |
delivery_completed | true when batches_pending == 0 and all batches sent |
fully_completed | true only after temporary files are also cleaned |
matched_raw_files | Total raw files matched in the time window |
batches_sent / batches_pending | Sent vs pending batch count |
log_lines_sent | Total lines delivered to the receiver |
batch_lines_avg / _min / _max | Distribution of batch sizes (lines per POST) |
replay_window_beijing | Replay window in Beijing time |
last_refresh_beijing | Last status update time |
Use GET /backfill/status?view=raw for the raw state document plus low-level artifact stats.
wrangler tail ctyun-logpush-backfill
Every successful Sender invocation emits one line like:
[BACKFILL][INFO] Sent batch lines=1000 bytes=1064735 (uncompressed)
-> HTTP 200 | ack_ms=408 queue_wait_ms=525 | processed-backfill/.../batch.txt
| Field | Meaning |
|---|---|
ack_ms | Time from fetch() start to receiver HTTP response (round-trip + receiver processing) |
queue_wait_ms | Time the batch sat in send-queue (from Parser enqueue to Sender pick-up + send) |
bytes | Uncompressed payload size; gzip-compressed body is ~10× smaller |
Parser also emits Parsing: ... and Done: ... per file, which lets you measure single-file wall time (subtract the two timestamps).
backfill-state/| File | When written |
|---|---|
progress.json | Updated by every cron tick (~1/min); always present once a run starts |
status.json | Written only when someone hits GET /backfill/status. If no one ever curls the endpoint, this file does not exist in R2. |
Files that do NOT exist in the current code: send-stats.json (aggregate ack_ms/queue_wait_ms statistics) was removed in the queue-only refactor. Per-batch ack_ms / queue_wait_ms are emitted to Worker logs only (see channel 2 above) — they are not aggregated into any R2 file. If you see a send-stats.json in an old bucket, it is leftover data from a previous code version.
After all batches are delivered:
inspectRunArtifacts — list every batch artifact and verify .done existscleanup.status = ready — entered when all artifacts confirmed sentcleanup.status = deleting — bulk delete processed-backfill/<run-id>/*cleanup.status = done — state marked cleaned, run is fully completeThe grace period is intentional: it gives operators a short window to inspect what was sent before files are removed.
If you need to replay the exact same [A, B] window after a successful run:
# 1. Delete state files
wrangler r2 object delete cdn-logs-raw/backfill-state/progress.json
wrangler r2 object delete cdn-logs-raw/backfill-state/status.json
# 2. Either redeploy, or wait for the next cron tick (BACKFILL_ENABLED stays "true")
Old processed-backfill/<run-id>/ prefixes do not block a fresh run — each run uses a new run_id (timestamp-suffixed).
This is the normal repeat-run path. Once the previous run is cleaned (or fully_completed = true), just update BACKFILL_START_TIME / BACKFILL_END_TIME and deploy again.
processed-backfill/<old-run-id>/ is required.backfill-state/progress.json / status.json is required.The Worker only auto-reinitializes when the previous state is already cleaned. If a run is still running / done, do not switch the time window yet.
The safest completion rule:
/backfill/status shows fully_completed = truesend-queue-backfill backlog is zero (visible in the Cloudflare Dashboard)