[ engine · gpu · cufolio ]

cuFOLIO · cuOpt PDLP

GPU portfolio optimization. Mean-CVaR linear program solved by NVIDIA cuOpt PDLP on top of KDE-generated scenarios. ~520 ms median solve on GB10 for an 8-symbol universe with 5,000 scenarios. The headline GPU number on the platform.

What it is · how it works · why it matters

[ what ]

GPU portfolio optimization. NVIDIA's Mean-CVaR blueprint, solved by cuOpt's PDLP solver on GB10 in ~520 ms.

[ how ]

5,000 Monte Carlo scenarios via KDE on GPU (~269 ms). LP minimizes tail loss at α=0.95 subject to long-only, max-position, leverage, cash-floor, transaction-cost constraints. cuOpt PDLP solves on GPU (~251 ms). Data never leaves device memory.

[ why ]

30-100× faster than CPU SciPy. Unlocks walk-forward backtests in seconds, AutoResearch sweeps over hundreds of parameter combinations in one session, and rebalance proposals fast enough to be interactive. The headline GPU win.

Overview

cuFOLIO is NVIDIA's open-source portfolio optimization blueprint. NVTrader uses two pieces:

KDE scenario generator — 5,000+ Monte Carlo return paths on GPU.
cuOpt PDLP solver — primal-dual hybrid gradient LP solver, GPU-native.

Together they minimize the Conditional Value-at-Risk (CVaR) of a long-only equity portfolio subject to position caps, leverage limits, transaction costs, and a cash floor.

The optimization problem

minimize        E[ max(0, -portfolio_return + VaR_α) ] · 1/(1-α)
                + λ · turnover_cost
                + μ · vol_penalty

subject to      Σᵢ wᵢ = 1                              (fully invested)
                0 ≤ wᵢ ≤ max_position_pct              (long-only, cap)
                wcash ≥ cash_floor_pct                  (cash buffer)
                Σᵢ |wᵢ − w0ᵢ| ≤ turnover_budget        (smoothness)

Where:

α is the CVaR confidence (default 0.95).
w0 is the current portfolio.
R is the scenarios matrix, shape (n_assets, n_scenarios), from KDE.

Knobs

knob	type	default	effect
`α`	float [0.90, 0.99]	0.95	Higher = more tail-averse. 0.99 weights the worst 1% of scenarios.
`n_scenarios`	int [500, 50000]	5000	Cost is linear in n_scenarios. 5k is a sweet spot on GB10.
`max_position_pct`	float [0, 1]	0.25	Single-name cap. Lower → more diversification.
`cash_floor_pct`	float [0, 0.5]	0.02	Minimum cash. Useful for rebalance-cost robustness.
`risk_aversion (λ)`	float	2.0	Trade-off between expected return and CVaR.
`turnover_cost_bps`	float	5.0	Round-trip transaction cost assumption.
`vol_penalty`	float	0.0	Penalize realized vol > target.
`solver_tolerance`	float	1e-6	PDLP convergence tolerance.

Calling it

REST

curl -X POST http://127.0.0.1:8015/api/backtest/solve_once \
     -H 'content-type: application/json' \
     -d '{
       "universe": ["SPY","QQQ","NVDA","MSFT","AAPL","META","AMZN","GOOGL"],
       "alpha": 0.95,
       "n_scenarios": 5000,
       "max_position_pct": 0.25
     }'

Python

from traderspace.optimizer.cufolio_engine import CuFolioEngine
engine = CuFolioEngine()
result = engine.optimize_cvar(
    universe=["SPY","QQQ","NVDA","MSFT","AAPL","META","AMZN","GOOGL"],
    alpha=0.95,
    n_scenarios=5000,
    max_position_pct=0.25,
)
print(result["weights"])
print(result["telemetry"])  # scen_ms, solve_ms, total_ms

From the A2A bus

Publish a CritiqueClean or ScenarioRefreshed event; PortfolioOptimizationAgent picks it up and runs a solve. See A2A event bus docs.

Reading the result

{
  "weights": [0.21, 0.18, 0.14, 0.12, 0.11, 0.09, 0.08, 0.07],
  "universe": ["SPY","QQQ","NVDA",...],
  "expected_return": 0.187,    // annualized
  "expected_cvar": -0.043,     // tail loss at α
  "telemetry": {
    "scen_ms": 269,
    "solve_ms": 251,
    "total_ms": 521,
    "scenarios": 5000,
    "alpha": 0.95,
    "device": "cuda:0",
    "solver_iterations": 142
  }
}

If device is cpu, you're on CPU fallback (30-100× slower). Check CUDA driver, torch.cuda.is_available(), and the cuFOLIO install.

Modifying

The engine wrapper is src/traderspace/optimizer/cufolio_engine.py.
Constraints are set up in _build_problem(). Add a new constraint there.
Scenario generation is in cuFOLIO upstream — fork and re-pip-install for custom KDE behavior.
Solver tolerance + warm-start are surfaced through the engine constructor.

Verified benchmark

config	scen gen	cuOpt solve	total
5k scen, 8 sym, α=0.95	269 ms	251 ms	~520 ms
10k scen, 50 sym, α=0.95	~580 ms	~510 ms	~1.1 s
5k scen, 8 sym, CPU SciPy	270 ms	~15-50 s	~15-50 s

All numbers from GB10 (compute 12.1), repeatable. The CPU fallback uses SciPy linprog for an apples-to-apples comparison; commercial CPLEX would be 5-10× faster than SciPy but still not approach the cuOpt PDLP throughput.