cuFOLIO · cuOpt PDLP
GPU portfolio optimization. Mean-CVaR linear program solved by NVIDIA cuOpt PDLP on top of KDE-generated scenarios. ~520 ms median solve on GB10 for an 8-symbol universe with 5,000 scenarios. The headline GPU number on the platform.
What it is · how it works · why it matters
GPU portfolio optimization. NVIDIA's Mean-CVaR blueprint, solved by cuOpt's PDLP solver on GB10 in ~520 ms.
5,000 Monte Carlo scenarios via KDE on GPU (~269 ms). LP minimizes tail loss at α=0.95 subject to long-only, max-position, leverage, cash-floor, transaction-cost constraints. cuOpt PDLP solves on GPU (~251 ms). Data never leaves device memory.
30-100× faster than CPU SciPy. Unlocks walk-forward backtests in seconds, AutoResearch sweeps over hundreds of parameter combinations in one session, and rebalance proposals fast enough to be interactive. The headline GPU win.
Overview
cuFOLIO is NVIDIA's open-source portfolio optimization blueprint. NVTrader uses two pieces:
- KDE scenario generator — 5,000+ Monte Carlo return paths on GPU.
- cuOpt PDLP solver — primal-dual hybrid gradient LP solver, GPU-native.
Together they minimize the Conditional Value-at-Risk (CVaR) of a long-only equity portfolio subject to position caps, leverage limits, transaction costs, and a cash floor.
The optimization problem
minimize E[ max(0, -portfolio_return + VaR_α) ] · 1/(1-α)
+ λ · turnover_cost
+ μ · vol_penalty
subject to Σᵢ wᵢ = 1 (fully invested)
0 ≤ wᵢ ≤ max_position_pct (long-only, cap)
wcash ≥ cash_floor_pct (cash buffer)
Σᵢ |wᵢ − w0ᵢ| ≤ turnover_budget (smoothness)
Where:
αis the CVaR confidence (default 0.95).w0is the current portfolio.Ris the scenarios matrix, shape(n_assets, n_scenarios), from KDE.
Knobs
| knob | type | default | effect |
|---|---|---|---|
α | float [0.90, 0.99] | 0.95 | Higher = more tail-averse. 0.99 weights the worst 1% of scenarios. |
n_scenarios | int [500, 50000] | 5000 | Cost is linear in n_scenarios. 5k is a sweet spot on GB10. |
max_position_pct | float [0, 1] | 0.25 | Single-name cap. Lower → more diversification. |
cash_floor_pct | float [0, 0.5] | 0.02 | Minimum cash. Useful for rebalance-cost robustness. |
risk_aversion (λ) | float | 2.0 | Trade-off between expected return and CVaR. |
turnover_cost_bps | float | 5.0 | Round-trip transaction cost assumption. |
vol_penalty | float | 0.0 | Penalize realized vol > target. |
solver_tolerance | float | 1e-6 | PDLP convergence tolerance. |
Calling it
REST
curl -X POST http://127.0.0.1:8015/api/backtest/solve_once \
-H 'content-type: application/json' \
-d '{
"universe": ["SPY","QQQ","NVDA","MSFT","AAPL","META","AMZN","GOOGL"],
"alpha": 0.95,
"n_scenarios": 5000,
"max_position_pct": 0.25
}'
Python
from traderspace.optimizer.cufolio_engine import CuFolioEngine
engine = CuFolioEngine()
result = engine.optimize_cvar(
universe=["SPY","QQQ","NVDA","MSFT","AAPL","META","AMZN","GOOGL"],
alpha=0.95,
n_scenarios=5000,
max_position_pct=0.25,
)
print(result["weights"])
print(result["telemetry"]) # scen_ms, solve_ms, total_ms
From the A2A bus
Publish a CritiqueClean or ScenarioRefreshed event; PortfolioOptimizationAgent picks it up and runs a solve. See A2A event bus docs.
Reading the result
{
"weights": [0.21, 0.18, 0.14, 0.12, 0.11, 0.09, 0.08, 0.07],
"universe": ["SPY","QQQ","NVDA",...],
"expected_return": 0.187, // annualized
"expected_cvar": -0.043, // tail loss at α
"telemetry": {
"scen_ms": 269,
"solve_ms": 251,
"total_ms": 521,
"scenarios": 5000,
"alpha": 0.95,
"device": "cuda:0",
"solver_iterations": 142
}
}
If device is cpu, you're on CPU fallback (30-100× slower). Check CUDA driver, torch.cuda.is_available(), and the cuFOLIO install.
Modifying
- The engine wrapper is
src/traderspace/optimizer/cufolio_engine.py. - Constraints are set up in
_build_problem(). Add a new constraint there. - Scenario generation is in cuFOLIO upstream — fork and re-pip-install for custom KDE behavior.
- Solver tolerance + warm-start are surfaced through the engine constructor.
Verified benchmark
| config | scen gen | cuOpt solve | total |
|---|---|---|---|
| 5k scen, 8 sym, α=0.95 | 269 ms | 251 ms | ~520 ms |
| 10k scen, 50 sym, α=0.95 | ~580 ms | ~510 ms | ~1.1 s |
| 5k scen, 8 sym, CPU SciPy | 270 ms | ~15-50 s | ~15-50 s |
All numbers from GB10 (compute 12.1), repeatable. The CPU fallback uses SciPy linprog for an apples-to-apples comparison; commercial CPLEX would be 5-10× faster than SciPy but still not approach the cuOpt PDLP throughput.