Issues·Root Cause·Fixes¶
Environment: RuoYi-Cloud (ruoyi-modules-robot / ruoyi-api-robot), Gateway unified entry
/external/gs/**
;
Nacos registry & config, Sentinel dynamic rules, SkyWalking full-link, optional RabbitMQ async; Docker Compose one-click delivery.
0. Go-live Definitions & Load-test Baseline¶
- Ambiguous definitions & baseline
Symptom: Different teammates use inconsistent definitions for “success/failure” and whether 429 counts, causing metric misinterpretation.
Fix: Standardize definitions & baseline: - Dry-Run / consumers disabled load-test baseline: 100 QPS × 3 minutes; count 200/429 as success (to validate gateway rate-limit/stability).
- Observed metrics: P95/P99, Throughput, Error%, queue backlog, SkyWalking Success Rate.
-
Sample result: 18,035 samples P95 = 6 ms, P99 ≈ 10 ms, Error = 0, Throughput ≈ 100.2/s (consistent across rounds).
-
Rollback & cold-start not quantifiable
Fix: Persist to a “Change Log”: - Cold start: Docker Compose + initial SQL → ≤ 12 minutes to bring up a new env;
- Rollback: after microservices split ≤ 5 minutes (revert image/config).
1. Gateway / Nacos / Sentinel¶
-
Nacos stuck at STARTING
Cause: Persistent MySQL version/charset mismatch or init SQL not imported.
Fix: Completespring.datasource.*
, import official/project init scripts; align versions & images (recommend 2.2.x). -
Gateway 404 / Swagger “Try it out” fails
Cause:StripPrefix/Path
inconsistent with downstreamcontext-path
; route not updated after prefix change.
Fix: Unify path strategy; if needed removecontext-path
or adjust Strip rules accordingly; deprecate old prefixes after gray rollout. -
Rate-limit paths inconsistent, rules not matched
Cause:@SentinelResource
scattered in Controllers, inconsistent resource names.
Fix: Consolidate intoGsOpenApiServiceImpl
; standardize resource naming and map one-to-one with Nacos rules
(e.g.,ruoyi-gateway-gw-api-defs/gw-flow/gw-degrade
). -
Sentinel rules not applied / empty group
Cause: Missingsentinel-datasource-nacos
dependency, or DataId/Group vs JSON mismatch.
Fix: Add dependency; verify the triple{AppName, DataId, Group}
and JSON structure; before load test raise gateway thresholds to avoid premature 429 blocking. -
CORS preflight blocked
Cause: OPTIONS not allowed or response headers missing.
Fix: Configure global CORS: allow requiredOrigin/Headers/Methods
, expose necessary headers; explicitly allow OPTIONS.
2. Observability & Infrastructure¶
- No Gateway spans / logs missing
traceId
Cause: SCG/WebFlux plugin not loaded;-javaagent
not injected; Logback lacks MDC.
Fix: - Put
apm-spring-cloud-gateway-3.x-plugin
into plugins; - Start with
-javaagent:/path/to/agent/agent.jar
; -
Add
%X{traceId}
to Logback pattern; raise sampling rate during load tests, reduce noise in production per QPS. -
Docker: MySQL port conflict / no data on first run
Cause: Host occupying3306
or init SQL not imported.
Fix: Adjust port mapping or stop host MySQL; pre-import SQL in compose; addhealthcheck
and usedepends_on.condition=service_healthy
upstream. -
In-container access fails (misusing host
localhost
)
Cause: Containers treat host address as local.
Fix: Container-to-container calls use serviceName:port; to access host on Win/Mac, usehost.docker.internal
. -
Dependency/plugin conflicts (e.g., Reactor Netty)
Cause: Spring Boot/Cloud matrix misaligned with dependencies.
Fix: Lock a version matrix; include only needed gateway/webflux/skywalking plugins and verify compatibility.
3. Asynchrony & Idempotency¶
- Queue keeps growing during load test / duplicate consumption
Cause: Demo env consumers disabled (Dry-Run) or missing prefetch/manual ack; retries without upper bound.
Fix: - Load-test definition: with consumers off, only validate ingress & observability; backlog is acceptable;
- Production: enable consumers, set
prefetch = n
, manual ack, failures go to DLQ; -
For write operations add idempotency keys; timeouts/retries must not amplify failures.
-
RestTemplate pile-up / retries amplify issues
Cause: No timeouts/connection pool; automatic retries enabled.
Fix: Setconnect/read
timeouts, pool size; disable automatic retries; enforce idempotency on critical writes.
4. Code & Frontend (RuoYi + Vue)¶
-
MyBatis
Invalid bound statement
Cause:mapper-locations
or XMLnamespace
≠ interface FQN; XML not packaged.
Fix: Align@MapperScan + namespace
; ensure XML are scanned and packaged. -
Frontend CORS / port mismatch
Cause: Proxy not updated when switching environments.
Fix: Unify proxy invue.config.js
; ports/targets switch with env variables. -
Minor UI issues (from phase summary)
Symptom: Area/map names wrap, status doesn’t auto-refresh, response structures inconsistent, etc.
Fix: - Adjust table column width +
tooltip
; - Add “manual refresh” (add timed refresh later);
- Robust parsing for API responses (“string/object”) to avoid misleading task dispatch hints;
- Missing
areaId
↔ coordinates mapping → align with vendor data table, include in later milestones.
5. Conventions & Governance (Unified Naming & Traceability)¶
- Inconsistent resource names / log fields → hard to troubleshoot
Fix: - Standardize resource naming (aligned with Sentinel rules) and unify exception bodies;
- Inject
traceId
,requestId
,taskId
,resource
into logs; - Postmortem process: deliver first, evolve later; leave traces for changes & rollbacks; codify “Issue → Cause → Fix” into docs.
Appendix: Reusable Checklist (Pre-Go-Live Self-Check)¶
- [ ]
gateway
routes/prefixes consistent with downstream paths; - [ ] Nacos rules (DataId/Group) aligned with application resource names;
- [ ] SkyWalking
-javaagent
injected, Gateway/WebFlux plugins in place; - [ ] Docker
healthcheck
complete; initial SQL auto-imported at first boot; - [ ] Load-test definitions when consumers are off; for production enable consumers with DLQ;
- [ ] CORS, timeouts, connection pool, idempotency & retry policies in place;
- [ ] Logs contain trace fields; support trace ↔ log cross-checks.
Metrics source: phase summary & demo load-test records (baseline 100 QPS, P95 ≈ 6 ms, P99 ≈ 9–10 ms, Error = 0).