[ engine · gpu · cufolio ]

cuFOLIO · cuOpt PDLP

GPU portfolio optimization. Mean-CVaR linear program solved by NVIDIA cuOpt PDLP on top of KDE-generated scenarios. ~520 ms median solve on GB10 for an 8-symbol universe with 5,000 scenarios. The headline GPU number on the platform.

What it is · how it works · why it matters

[ what ]

GPU portfolio optimization. NVIDIA's Mean-CVaR blueprint, solved by cuOpt's PDLP solver on GB10 in ~520 ms.

[ how ]

5,000 Monte Carlo scenarios via KDE on GPU (~269 ms). LP minimizes tail loss at α=0.95 subject to long-only, max-position, leverage, cash-floor, transaction-cost constraints. cuOpt PDLP solves on GPU (~251 ms). Data never leaves device memory.

[ why ]

30-100× faster than CPU SciPy. Unlocks walk-forward backtests in seconds, AutoResearch sweeps over hundreds of parameter combinations in one session, and rebalance proposals fast enough to be interactive. The headline GPU win.

Overview

cuFOLIO is NVIDIA's open-source portfolio optimization blueprint. NVTrader uses two pieces:

  1. KDE scenario generator — 5,000+ Monte Carlo return paths on GPU.
  2. cuOpt PDLP solver — primal-dual hybrid gradient LP solver, GPU-native.

Together they minimize the Conditional Value-at-Risk (CVaR) of a long-only equity portfolio subject to position caps, leverage limits, transaction costs, and a cash floor.

The optimization problem

minimize        E[ max(0, -portfolio_return + VaR_α) ] · 1/(1-α)
                + λ · turnover_cost
                + μ · vol_penalty

subject to      Σᵢ wᵢ = 1                              (fully invested)
                0 ≤ wᵢ ≤ max_position_pct              (long-only, cap)
                wcash ≥ cash_floor_pct                  (cash buffer)
                Σᵢ |wᵢ − w0ᵢ| ≤ turnover_budget        (smoothness)

Where:

Knobs

knobtypedefaulteffect
αfloat [0.90, 0.99]0.95Higher = more tail-averse. 0.99 weights the worst 1% of scenarios.
n_scenariosint [500, 50000]5000Cost is linear in n_scenarios. 5k is a sweet spot on GB10.
max_position_pctfloat [0, 1]0.25Single-name cap. Lower → more diversification.
cash_floor_pctfloat [0, 0.5]0.02Minimum cash. Useful for rebalance-cost robustness.
risk_aversion (λ)float2.0Trade-off between expected return and CVaR.
turnover_cost_bpsfloat5.0Round-trip transaction cost assumption.
vol_penaltyfloat0.0Penalize realized vol > target.
solver_tolerancefloat1e-6PDLP convergence tolerance.

Calling it

REST

curl -X POST http://127.0.0.1:8015/api/backtest/solve_once \
     -H 'content-type: application/json' \
     -d '{
       "universe": ["SPY","QQQ","NVDA","MSFT","AAPL","META","AMZN","GOOGL"],
       "alpha": 0.95,
       "n_scenarios": 5000,
       "max_position_pct": 0.25
     }'

Python

from traderspace.optimizer.cufolio_engine import CuFolioEngine
engine = CuFolioEngine()
result = engine.optimize_cvar(
    universe=["SPY","QQQ","NVDA","MSFT","AAPL","META","AMZN","GOOGL"],
    alpha=0.95,
    n_scenarios=5000,
    max_position_pct=0.25,
)
print(result["weights"])
print(result["telemetry"])  # scen_ms, solve_ms, total_ms

From the A2A bus

Publish a CritiqueClean or ScenarioRefreshed event; PortfolioOptimizationAgent picks it up and runs a solve. See A2A event bus docs.

Reading the result

{
  "weights": [0.21, 0.18, 0.14, 0.12, 0.11, 0.09, 0.08, 0.07],
  "universe": ["SPY","QQQ","NVDA",...],
  "expected_return": 0.187,    // annualized
  "expected_cvar": -0.043,     // tail loss at α
  "telemetry": {
    "scen_ms": 269,
    "solve_ms": 251,
    "total_ms": 521,
    "scenarios": 5000,
    "alpha": 0.95,
    "device": "cuda:0",
    "solver_iterations": 142
  }
}

If device is cpu, you're on CPU fallback (30-100× slower). Check CUDA driver, torch.cuda.is_available(), and the cuFOLIO install.

Modifying

Verified benchmark

configscen gencuOpt solvetotal
5k scen, 8 sym, α=0.95269 ms251 ms~520 ms
10k scen, 50 sym, α=0.95~580 ms~510 ms~1.1 s
5k scen, 8 sym, CPU SciPy270 ms~15-50 s~15-50 s

All numbers from GB10 (compute 12.1), repeatable. The CPU fallback uses SciPy linprog for an apples-to-apples comparison; commercial CPLEX would be 5-10× faster than SciPy but still not approach the cuOpt PDLP throughput.

NVTrader v0.1.18 · docs ·⚠ Not financial advice ·Docs home ·App