<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://senyangom.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://senyangom.github.io/" rel="alternate" type="text/html" /><updated>2026-04-27T00:48:14-04:00</updated><id>https://senyangom.github.io/feed.xml</id><title type="html">Sen Yang</title><subtitle>Ph.D., Operations Management, NYU Stern. Multi-LLM agent systems for quantitative research; online learning and stochastic optimization.</subtitle><author><name>Sen Yang</name><email>sy2576@stern.nyu.edu</email></author><entry><title type="html">AlphaBot: a multi-LLM agent system for systematic alpha discovery</title><link href="https://senyangom.github.io/posts/2026/04/alphabot-multi-llm-alpha-discovery/" rel="alternate" type="text/html" title="AlphaBot: a multi-LLM agent system for systematic alpha discovery" /><published>2026-04-19T00:00:00-04:00</published><updated>2026-04-19T00:00:00-04:00</updated><id>https://senyangom.github.io/posts/2026/04/alphabot-multi-llm-alpha-discovery</id><content type="html" xml:base="https://senyangom.github.io/posts/2026/04/alphabot-multi-llm-alpha-discovery/"><![CDATA[<blockquote>
  <p><strong>Status: draft, sanitized public version pending Cubist review.</strong></p>
</blockquote>

<h2 id="what-alphabot-is">What AlphaBot is</h2>

<p>AlphaBot is a multi-LLM agent system that does alpha research the way a quantitative team does: it composes validated patterns into candidate trading signals, backtests them under hedge-fund-grade discipline, and ships only the survivors. Three frontier LLM families propose; the harness disposes.</p>

<p>It runs in two markets: <strong>US equities at Cubist (Point72)</strong>, and an <strong>independent self-funded crypto deployment</strong> outside Cubist with an 18-month public live track.</p>

<h2 id="architecture-in-layers">Architecture, in layers</h2>

<p>AlphaBot’s architecture is layered, not stepped — each layer is a typed surface the next one composes on:</p>

<ul>
  <li><strong>Data.</strong> Market data — prices, volumes, fundamentals, microstructure features, signals from filings — ingested into a normalized research store with deterministic replay.</li>
  <li><strong>Centers of Expertise (CoE).</strong> A library of validated patterns and operators drawn from where signal-extraction has been studied seriously: statistics, physics, signal processing, electrical engineering, information theory, and adjacent domains. Each CoE is a compact declarative description of a regularity — a <em>thematic primitive</em> the language model can compose against. AlphaBot currently runs across <strong>150+ CoEs</strong>.</li>
  <li><strong>Skills.</strong> Composable research operations — backtesting under train/validate splits with overfitting controls, feature selection with statistical filters, ensemble prediction models across regression families, loss functions, and targets. Skills consume data and CoE-derived candidates, and produce structured research output.</li>
</ul>

<p>A research solution sits on top of the three layers: <em>trading-signal generation.</em> Three LLMs propose candidate factor expressions conditioned on (data, CoE, skill); the candidates flow through the harness; survivors get traded.</p>

<h2 id="why-a-closed-loop-matters">Why a closed loop matters</h2>

<p>The discourse on “LLM-driven research” tends to optimize for the demo. In a real research environment, what matters is whether the system produces signals that survive an out-of-sample regime the model has never seen. That requires a harness, not a chatbot:</p>

<ul>
  <li>train / validate splits on different time periods,</li>
  <li>odd / even sample splits to catch overfit-to-regime,</li>
  <li>joint thresholds on train <em>and</em> validate Sharpe,</li>
  <li>a final out-of-sample holdout the brainstorm LLMs never saw.</li>
</ul>

<p>The language models propose. The harness disposes.</p>

<h2 id="multi-llm-proposal-an-empirical-ranking">Multi-LLM proposal: an empirical ranking</h2>

<p>Three frontier model families propose in parallel: Claude Opus, GPT-5, Gemini 2.5 Pro. Across many alpha rounds against the same harness, three signals are stable:</p>

<ul>
  <li><strong>Research-output quality.</strong> On this task, <strong>Opus &gt; GPT &gt; Gemini</strong>.</li>
  <li><strong>Orthogonality.</strong> The two strongest models propose substantially complementary candidates rather than redundant ones — the union of their proposals is materially wider than either alone.</li>
  <li><strong>Cost-effectiveness.</strong> On a yield-per-dollar basis, Opus is the most cost-effective of the three for research-grade generation.</li>
</ul>

<p>This is why AlphaBot runs three model families side-by-side: the harness, not the LLM, is the bar that decides what survives, so wider candidate surface is better at the proposal step.</p>

<h2 id="scale-to-date">Scale to date</h2>

<p>AlphaBot has brainstormed <strong>over 100,000 candidate signals with reasoning</strong> across 150+ CoEs and three frontier LLM families, and has produced <strong>400+ production-level signals</strong> in the equities deployment, spanning momentum, mean reversion, liquidity, information flow, and other risk-factor categories. Ensemble prediction models combining surviving signals attain promising Sharpe on both large- and small-cap US equities. (Specifics IP-protected; qualitative summary only.)</p>

<h2 id="the-cross-market-check">The cross-market check</h2>

<p>The same architecture runs on a self-funded crypto research project outside Cubist (joint with Beier Liu). <a href="https://dash.300k.xyz/group/300kinvestorshowcaseaccounts?period=30d">Live performance is public</a> over an 18-month track on a self-funded testing account.</p>

<p>Two markets, same scaffolding — that’s the point: when the same agent architecture produces surviving signals in markets with very different microstructure (US equities vs mid-frequency crypto), the architecture is doing real work, not overfitting to a single regime.</p>

<hr />

<p><em>Sanitized public version forthcoming.</em></p>]]></content><author><name>Sen Yang</name><email>sy2576@stern.nyu.edu</email></author><category term="llm-agents" /><category term="quantitative-finance" /><category term="alpha-research" /><category term="backtest" /><summary type="html"><![CDATA[Status: draft, sanitized public version pending Cubist review.]]></summary></entry><entry><title type="html">IvorySquare: peer-reviewed methodology as a tool surface for LLM agents</title><link href="https://senyangom.github.io/posts/2026/04/ivorysquare-architecture/" rel="alternate" type="text/html" title="IvorySquare: peer-reviewed methodology as a tool surface for LLM agents" /><published>2026-04-19T00:00:00-04:00</published><updated>2026-04-19T00:00:00-04:00</updated><id>https://senyangom.github.io/posts/2026/04/ivorysquare-architecture</id><content type="html" xml:base="https://senyangom.github.io/posts/2026/04/ivorysquare-architecture/"><![CDATA[<p><em>IvorySquare — an open Ivory Tower for everyone.</em></p>

<h2 id="the-premise">The premise</h2>

<p>Most “AI for finance” tools are wrappers around chat, with a human as the user of the language model. IvorySquare inverts that premise: the language model is the user, and the system exposes verifiable, citation-grounded methodology as the tool surface it consumes. Finance, accounting, economics, and operations research have produced decades of peer-reviewed methods — each with a formula, a data requirement, an empirical validation, and a citation. IvorySquare turns that body of work into typed, test-backed, provenance-tracked skills the agent calls against.</p>

<p>The architecture is a stack of layers, each with its own contract: a layered data and standardization stack, a foundational curriculum layer at textbook-subsection granularity, a paper-derived skill layer, four LLM persona configurations that gate every artifact, and an evaluation harness that decides what ships.</p>

<h2 id="layered-data-stack-l0--l5">Layered data stack (L0 – L5)</h2>

<p>Ingestion of corporate disclosures (SEC filings, XBRL / iXBRL), market data, and related sources flows through APIs and MCP connectors into a typed store with deterministic replay and content-hash versioning at every layer:</p>

<ul>
  <li><strong>L0 raw</strong> — original payloads, captured byte-for-byte with their fetch URL, retrieval timestamp, and content hash.</li>
  <li><strong>L1 parsed</strong> — structured representations (XBRL facts, parsed PDF text, normalized JSON) with original-source line-item provenance preserved.</li>
  <li><strong>L2 standardized</strong> — canonical accounting concepts and market-data fields under the standardization layer’s vocabulary, with unit and period normalization.</li>
  <li><strong>L3 derived</strong> — fundamental-extraction and ratio computations performed via the foundational skill layer.</li>
  <li><strong>L4 modeled</strong> — paper-derived methodology applied (Beneish M-Score, Altman Z-Score, readability/complexity factors, and so on).</li>
  <li><strong>L5 delivery</strong> — composed reports, agent-callable JSON, and downstream-consumer surfaces.</li>
</ul>

<p>Every consumer of an L_n artifact carries a citation chain back to the L0 source, so a numeric output in an L5 deliverable is traceable to a specific filing line item with hash, retrieval timestamp, and source URL.</p>

<h2 id="standardization-layer">Standardization layer</h2>

<p>Between L1 and L2 sits the standardization layer: a typed mapping from filer-specific tags (XBRL element names, vendor-specific market-data fields, paper-specific worked-example labels) to IvorySquare’s canonical vocabulary. Standardization is reversible — every canonical fact records the source tag, the mapping rule, and the rule’s version, so a later schema change replays cleanly without losing the prior ground truth. Mappings are reviewed by the <code class="language-plaintext highlighter-rouge">accounting_expert</code> persona before promotion.</p>

<h2 id="foundational-curriculum-layer">Foundational curriculum layer</h2>

<p>The skill graph is anchored at the bottom by a foundational layer of textbook-subsection-granular concept skills. Two branches populate the layer:</p>

<ul>
  <li><strong>Finance/accounting:</strong> CFA Level 1 — FSA, Equity, Corporate Issuers — and a CPA FAR review outline.</li>
  <li><strong>Operations research:</strong> Bertsimas-Tsitsiklis <em>Introduction to Linear Optimization</em>, Boyd-Vandenberghe <em>Convex Optimization</em>, Ross <em>Stochastic Processes</em>, Ross <em>A First Course in Probability</em>.</li>
</ul>

<p>Eight textbook TOCs ingest into a YAML-backed prerequisite DAG with ninety-five candidate subsection nodes; node identifiers preserve <code class="language-plaintext highlighter-rouge">&lt;branch&gt;/&lt;book_id&gt;/chXX__YY__sub</code> shape so prerequisite chains remain readable in both the curriculum audit log and the materialized skill tree. The graph is queryable through the <code class="language-plaintext highlighter-rouge">mvp curriculum</code> CLI surface (<code class="language-plaintext highlighter-rouge">ingest</code>, <code class="language-plaintext highlighter-rouge">filter</code>, <code class="language-plaintext highlighter-rouge">materialize</code>, <code class="language-plaintext highlighter-rouge">graph</code>) and renders to Graphviz DOT or SVG.</p>

<h3 id="two-dimensional-bare-llm-filter">Two-dimensional bare-LLM filter</h3>

<p>Each candidate node’s question bank runs N=10 trials per question through <code class="language-plaintext highlighter-rouge">claude-haiku-4-5</code> and records both the pass rate and a failure-mode taxonomy: <code class="language-plaintext highlighter-rouge">qualitative_correct</code>, <code class="language-plaintext highlighter-rouge">computational_off_by_arithmetic</code>, <code class="language-plaintext highlighter-rouge">structural_misunderstanding</code>, <code class="language-plaintext highlighter-rouge">unit_or_dimension_error</code>, <code class="language-plaintext highlighter-rouge">partial_correct</code>. Surviving nodes materialize under one of three reasons:</p>

<ul>
  <li><strong><code class="language-plaintext highlighter-rouge">closed_form_determinism</code>.</strong> The subsection involves closed-form numerical calculation (Black-Scholes pricing, simplex pivots, NPV / IRR, ratio analyses, KKT residual checks). These materialize regardless of pass rate — an 88% pass rate on a deterministic computation still leaves 12% silently wrong calls that downstream consumers would treat as authoritative. Every <code class="language-plaintext highlighter-rouge">closed_form_determinism</code> skill ships a <code class="language-plaintext highlighter-rouge">code/</code> reference implementation plus a green <code class="language-plaintext highlighter-rouge">pytest</code> unit-test file.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">conceptual_high_value</code>.</strong> Pass rate sits in the [0.85, 0.95] band on conceptual content; the markdown surface alone adds value. Skill is markdown-only.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">llm_fails</code>.</strong> Pass rate is below 0.85; code backing is required to make the subsection deterministic.</li>
</ul>

<p>Subsections are dropped when pass rate exceeds 0.95 with only benign failures.</p>

<p>The first materialized batch covers twelve subsection skills across both branches — nine <code class="language-plaintext highlighter-rouge">closed_form_determinism</code> skills (each shipping a <code class="language-plaintext highlighter-rouge">code/</code> reference plus unit-test file under <code class="language-plaintext highlighter-rouge">mvp/tests/</code>), two <code class="language-plaintext highlighter-rouge">conceptual_high_value</code> markdown-only skills, and one <code class="language-plaintext highlighter-rouge">llm_fails</code> code-backed skill — every manifest validates against the strict <code class="language-plaintext highlighter-rouge">SkillManifest</code> schema, every code reference is exercised by a green <code class="language-plaintext highlighter-rouge">pytest</code> suite, and every node carries the bare-LLM pass-rate snapshot plus failure-mode taxonomy alongside the curated <code class="language-plaintext highlighter-rouge">concept.md</code>. Verbatim textbook content is excluded by policy: only TOC structure, IvorySquare-authored paraphrases, and IvorySquare-original worked examples ship in the repository.</p>

<h3 id="foundational-skill-layout">Foundational skill layout</h3>

<p>Per surviving subsection:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mvp/skills/foundational/&lt;branch&gt;/&lt;book_id&gt;/&lt;chapter&gt;__&lt;section&gt;__&lt;subsection&gt;/
├── concept.md              # definition, intuition, examples, prereq links
├── prereqs.yaml            # explicit prerequisite skill_ids (drives the DAG)
├── eval/
│   ├── question_bank.yaml  # 10–25 textbook-style questions with expected answers
│   └── llm_baseline.json   # bare-LLM pass-rate snapshot + failure-mode taxonomy
├── code/                   # reference implementation when materialization_reason
│   └── ...                 #   is closed_form_determinism or llm_fails
├── manifest.yaml           # standard IvorySquare manifest; layer: foundational
└── README.md               # public-facing summary
</code></pre></div></div>

<h2 id="paper-derived-skill-layer">Paper-derived skill layer</h2>

<p>Above the curriculum layer, a paper-derived skill layer carries citation-grounded implementations from peer-reviewed papers. Each paper-derived skill carries the formula and its paper reference, a unit-test harness against worked examples within ±0.05, a provenance trace from numeric output down to the source filing line item, and a declarative interface callable from either an MCP server or an OpenAI tool specification.</p>

<h3 id="deep-paper-to-skill-pipeline">Deep paper-to-skill pipeline</h3>

<p>The paper-derived layer is produced by a six-stage LLM-orchestrated pipeline that targets ≈5M tokens of deliberate spend per paper across well-defined stages with explicit per-stage budgets:</p>

<table>
  <thead>
    <tr>
      <th>Stage</th>
      <th>Description</th>
      <th>Persona</th>
      <th style="text-align: right">Target tokens</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>A1. Extract</td>
      <td>PDF → structured JSON: TOC, equations, tables, sample characteristics, worked examples, threshold values</td>
      <td>deterministic + <code class="language-plaintext highlighter-rouge">quant_finance_methodologist</code> review</td>
      <td style="text-align: right">~500K</td>
    </tr>
    <tr>
      <td>A2. Digest</td>
      <td>Long-form digest covering intuition, paper-exact formulas, worked-example reproductions, edge cases, sample-period assumptions, prerequisites</td>
      <td><code class="language-plaintext highlighter-rouge">quant_finance_methodologist</code> (primary) + <code class="language-plaintext highlighter-rouge">accounting_expert</code> (audit)</td>
      <td style="text-align: right">~1M</td>
    </tr>
    <tr>
      <td>A3. Implementation</td>
      <td><code class="language-plaintext highlighter-rouge">skill.py</code> + <code class="language-plaintext highlighter-rouge">manifest.yaml</code> + (if applicable) <code class="language-plaintext highlighter-rouge">rules/templates/&lt;skill_id&gt;.yaml</code>; iterate to compile + import-clean</td>
      <td><code class="language-plaintext highlighter-rouge">quant_finance_methodologist</code> author → <code class="language-plaintext highlighter-rouge">accounting_expert</code> rule template</td>
      <td style="text-align: right">~1M</td>
    </tr>
    <tr>
      <td>A4. Unit-test authoring</td>
      <td>25–40 unit tests covering paper’s worked examples, null propagation, missing-line-item handling, edge inputs, citation contract</td>
      <td><code class="language-plaintext highlighter-rouge">evaluation_agent</code></td>
      <td style="text-align: right">~500K</td>
    </tr>
    <tr>
      <td>A5. Replication harness</td>
      <td>Skill runs against paper’s reported worked-example outputs; iterate until tolerance is met or deviation is documented in <code class="language-plaintext highlighter-rouge">implementation_decisions[]</code></td>
      <td><code class="language-plaintext highlighter-rouge">quant_finance_methodologist</code></td>
      <td style="text-align: right">~500K</td>
    </tr>
    <tr>
      <td>A6. Verification + persona review</td>
      <td><code class="language-plaintext highlighter-rouge">citation_auditor</code> resolves every citation; <code class="language-plaintext highlighter-rouge">accounting_expert</code> audits rule template; <code class="language-plaintext highlighter-rouge">evaluation_agent</code> authors gold cases under <code class="language-plaintext highlighter-rouge">eval/gold/&lt;skill&gt;/</code></td>
      <td>all four personas</td>
      <td style="text-align: right">~1.5M</td>
    </tr>
  </tbody>
</table>

<h3 id="cost-tracking-instrumentation">Cost-tracking instrumentation</h3>

<p>Each stage is wrapped in a cost-tracking context manager that records per-call token counts — <code class="language-plaintext highlighter-rouge">(stage_id, persona, input_tokens, output_tokens, cache_read, cache_creation)</code> — to <code class="language-plaintext highlighter-rouge">mvp/agents/cost_log/&lt;run_id&gt;.jsonl</code>. Aggregation surfaces per-stage, per-persona, and per-model totals through <code class="language-plaintext highlighter-rouge">mvp.lib.cost_tracking.summarize</code> and the <code class="language-plaintext highlighter-rouge">mvp skills cost &lt;skill_id&gt;</code> CLI subcommand. The manifest field <code class="language-plaintext highlighter-rouge">cost_observed_last_n_runs</code> populates from this log, so a downstream consumer can see exactly how much the skill cost to author and at what stage the spend concentrated.</p>

<h3 id="persona-gates">Persona gates</h3>

<p>Stages are gated by structured persona verdicts — <code class="language-plaintext highlighter-rouge">go</code> / <code class="language-plaintext highlighter-rouge">revise</code> / <code class="language-plaintext highlighter-rouge">block</code> — written into a per-run audit-log directory at <code class="language-plaintext highlighter-rouge">mvp/agents/audit_log/&lt;run_id&gt;/</code>. An upstream stage cannot proceed until its gate persona signs off; a <code class="language-plaintext highlighter-rouge">block</code> halts the run with a typed <code class="language-plaintext highlighter-rouge">revisions_needed[]</code> block the caller can act on. Stages can revise upstream artifacts; iteration cost stays inside the budget.</p>

<h3 id="calibration-vs-fresh-modes">Calibration vs fresh modes</h3>

<p>The orchestrator runs in two modes:</p>

<ul>
  <li><strong>Calibration mode</strong> re-processes an already-onboarded paper and emits a structured delta against the shipped artifacts — token spend per stage, per-persona token attribution, replication-tolerance comparison, citation-contract diff. Calibration runs validate that the deep-pipeline output matches existing skills within tolerance.</li>
  <li><strong>Fresh mode</strong> produces a new paper-derived skill ready for promotion into the registry, fully under the deep pipeline without manual intervention beyond the persona-block override mechanism.</li>
</ul>

<h3 id="acceptance-gates">Acceptance gates</h3>

<p>Codified in <code class="language-plaintext highlighter-rouge">success_criteria.md</code> §14:</p>

<ul>
  <li>Per-stage token spend within ±20% of target.</li>
  <li>Paper-replication tolerance met on every worked example, or a documented <code class="language-plaintext highlighter-rouge">implementation_decisions[]</code> entry justifying the deviation.</li>
  <li>Citation contract intact under the <code class="language-plaintext highlighter-rouge">citation_auditor</code>’s review (every citation resolves to a line-item source).</li>
  <li>Gold cases authored for every worked example by the <code class="language-plaintext highlighter-rouge">evaluation_agent</code>.</li>
</ul>

<h2 id="four-llm-persona-configurations">Four LLM persona configurations</h2>

<p>Four LLM persona configurations carry the contracts they fulfil at every pipeline stage in declarative YAML under <code class="language-plaintext highlighter-rouge">mvp/human_layer/personas/</code>:</p>

<ul>
  <li><strong><code class="language-plaintext highlighter-rouge">accounting_expert</code></strong> — audits rule templates, reviews standardization mappings, gates A2 digest and A3 implementation for accounting-driven papers.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">quant_finance_methodologist</code></strong> — authors A2 digests, drafts A3 implementations, gates A1 extraction and A5 replication.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">evaluation_agent</code></strong> — authors A4 unit tests and A6 gold cases, gates the harness contract.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">citation_auditor</code></strong> — resolves every citation in A6 verification at the line-item level.</li>
</ul>

<p>The human-layer surface stays disjoint from the engineering layer: persona prompts compose at runtime against typed <code class="language-plaintext highlighter-rouge">input_contract_description</code> cases per stage, and the same four personas service both the foundational layer (concept-page authoring and audit) and the paper-derived layer (paper-to-skill pipeline).</p>

<h2 id="mcp--openai-tool-surfaces">MCP + OpenAI tool surfaces</h2>

<p>Every skill — foundational and paper-derived — exposes a declarative interface callable from either an MCP server or an OpenAI tool specification. One library, two surfaces, no translation glue. The MCP server publishes skills as MCP tools (<code class="language-plaintext highlighter-rouge">tools/list</code>, <code class="language-plaintext highlighter-rouge">tools/call</code>, with structured-content output and resource references for citations); the OpenAI tool spec publishes the same set as <code class="language-plaintext highlighter-rouge">function</code> tools with JSON-Schema-conformant arguments. Switching between Anthropic agents (Claude Opus / Sonnet / Haiku via the Anthropic SDK) and OpenAI agents (GPT-5 with reasoning tokens) requires no skill-side change.</p>

<h2 id="evaluation-harness">Evaluation harness</h2>

<p>A first-class design concern, not an afterthought. Every solution is gated by an evaluation harness tuned to the review bar of the domain it serves — rubric-driven scoring for accounting interpretation, overfitting-aware statistical validation for quantitative research. The harness, not the language model, decides what ships.</p>

<h2 id="research-direction">Research direction</h2>

<p>Skills are paper-derived and citation-audited; a <em>library</em> of skills is more than a collection — it is a graph whose topology mirrors the citation and conceptual structure of the underlying literature. That graph is, in principle, a structured post-training substrate for tool-using LLMs: each skill supplies both a tool-use trace and a verifiable ground-truth signal, the foundational curriculum layer supplies the prerequisite ordering, and the paper-derived layer supplies the citation topology — together, a natural curriculum from primitive methods to composite ones.</p>

<p>This is the research direction that motivates the framework.</p>

<hr />

<p><em>Repo: <a href="https://github.com/SenYangOM/IvorySquareSolutions">github.com/SenYangOM/IvorySquareSolutions</a>. Comments / questions: sy2576 [at] stern.nyu.edu.</em></p>]]></content><author><name>Sen Yang</name><email>sy2576@stern.nyu.edu</email></author><category term="llm-agents" /><category term="architecture" /><category term="mcp" /><category term="tool-use" /><summary type="html"><![CDATA[IvorySquare — an open Ivory Tower for everyone.]]></summary></entry></feed>