Backtesting Solana Trading Strategies with Historical Raw Blocks (2026)
How to backtest Solana DEX arb, token sniping, liquidation, and LP strategies using historical raw blocks and parsed trading datasets. Python, DuckDB, and Rust.
On this page +
The first time we pointed a script at a month of raw blocks and ran a row count, we got 2.3 billion transactions. That was our first result and it was wrong. We went back, re-ran the count, got the same number. Started investigating. About 78% of those transactions were vote transactions: validators confirming slots, not trades. Filter those out and you're at roughly 500 million rows. That's the first thing that breaks a new backtesting setup, and it's documented right on our product page because we've seen teams spend two days debugging it before asking us.
That's one problem. The data quality problem. The infrastructure problem is different, and runs deeper.
RPC nodes cannot replay historical state. You can call getBlock and get the transaction list for any slot. You cannot call getAccountInfo for slot 240,000,000 and get what that account held at that moment. The node will return current state, every time. If your strategy depends on knowing what a lending position's health factor was three months ago, or what reserves a pool held before a large swap, RPC cannot give you that. The block archive is the only source.
CEX price feeds make this worse in a way that's easy to miss. OHLCV aggregates across order books on centralized venues. Solana DEX prices are per-swap, per-slot, including failed transactions that consume block space but don't move the price. A strategy built on Binance 1-minute candles and tested on Solana will have wrong fill prices, wrong timing, and no visibility into on-chain competition for the same order. The strategies look fine on paper. They behave differently live. That gap almost always traces back to the same handful of data errors, not the strategy logic itself.
01What's Wrong With the Data You're Probably Using#
We've debugged enough broken backtests to know which ones are common. Most are silent: no error thrown, just a wrong number that looks plausible.
CEX OHLCV vs on-chain swaps. Binance 1-minute candles aggregate across a central order book. They do not represent what actually happened in any Solana pool at any specific slot. Solana DEX trades happen at the exact slot they're included in. Multiple competing swaps can land in the same slot at slightly different prices depending on transaction ordering. A backtest that uses CEX OHLCV as its price source will undercount opportunities and miscalculate fill prices throughout.
The PumpSwap decimal trap. PumpSwap quotes one side of a swap in lamports (9 decimals) and the other in token base units (typically 6 decimals). If you compute a price ratio from raw amount_in divided by amount_out, the result is wrong by a factor of 10^(9 - token_decimals). For a standard 6-decimal token, that's a 1000x error. No exception is thrown. The number looks like a price. We've seen this go unnoticed for weeks in production code. The fix is to always use the _ui columns from NLN trading datasets, which handle this normalization before delivery.
Raydium's three-program problem. Raydium runs three separate programs: AMM v4 (675kPX9MHTjS2zt1qfr1NYHuzeLXfQM9H24wFSUt1Mp8), CLMM (CAMMCzo5YL8w4VFF8KVHrK22GGUsp5VTaW7grrKgrWqK), and CPMM (CPMMoo8L3F4NbTegBCKVNunggL7H1ZpdTHKxQB5qKP1C). A dataset built on AMM v4 only misses all concentrated liquidity pool volume and all CPMM pool volume. That's a large fraction of Raydium flow, and it's a silent gap: your arb detector runs, finds no arb on some pairs, and the reason is missing data rather than no opportunity. In NLN trading datasets the programVariant column distinguishes which program emitted each event, so all three are in the same Parquet files.
Slot skips. Solana targets 400ms per slot but validators skip slots. The index.parquet file included with the block archive has one row per slot with a tx_count column. Slots with tx_count = 0 were skipped. A simulation that assumes slot N+2 always has a block is wrong on those slots, and any fill delay logic built on that assumption will produce incorrect results during periods of high skip rate.
02What a Raw Block Actually Contains#
Every block_<slot>.json in the archive is a raw getBlock response. The fields you actually use for trading strategy work are a small subset. Most of the response is validator bookkeeping you can ignore.
The top-level fields you'll use most:
transactions[]: the full list of all transactions in the slot, vote and non-voteblockTime: Unix timestamp (seconds), null for slots before approximately early 2020parentSlot: the previous confirmed slot numberrewards[]: validator rewards for this slot, including the block reward and staking yields
Inside each transaction:
transaction.message.accountKeys[]: all accounts referenced, in order. Instruction data references these by index, not by address directly.meta.preTokenBalances[]/meta.postTokenBalances[]: token balance for each (account, mint) pair before and after the transaction. Fields:accountIndex,mint,owner,uiTokenAmount(withamount,decimals,uiAmount).meta.logMessages[]: program log output. For programs without a published IDL, this is often where decoded events live asProgram log: ...strings.meta.err: null for a successful transaction, an error object for a failed one. Failed transactions consume block space and block compute, they just don't commit their state changes.meta.computeUnitsConsumed: useful for MEV analysis and for understanding how full a slot is.
The vote program address is Vote111111111111111111111111111111111111111. Any transaction where this address appears in transaction.message.accountKeys is a vote transaction. Filter it out before doing anything else.
On most days, that last print will show 15–25% of non-vote transactions failed. Failed transactions are real: they tried, they consumed compute, they just didn't commit. For sniping and arb backtests, the competition you're modeling is partly those failed transactions.
03Two Paths to the Data#
The block archive and the trading datasets solve different problems, and which one you start with determines how fast you get to a first result.
Historical Raw Blocks is the complete getBlock archive from genesis to the previous UTC midnight. Files arrive as a tar.zst archive via signed URL, extracted to flat block_<slot>.json files with no nested directories. The archive includes a manifest.json (start/end slot, file count, total size, SHA-256 per file) and an optional index.parquet (one row per slot: slot, block_time, parent_slot, leader, tx_count, file_name). The last 30 days is roughly 2TB; 6 months is around 12TB.
Trading Datasets are pre-parsed Parquet files across 40+ programs, one file per day, 4 to 12 GB compressed per program per month. They are DuckDB-ready out of the box with _ui normalized columns, pre-computed usd_value, and the programVariant tag on multi-program protocols. No extraction step: DuckDB reads the Parquet directly from the tar.zst via its built-in reader.
For teams starting out: if your strategy involves a program listed in the NLN trading datasets, start there. You'll be running DuckDB queries in an hour instead of writing a parser. Use raw blocks when you're working on a protocol not in the covered list, or when the failed-transaction context matters for your strategy.
The index.parquet is worth loading before anything else when you're working with a large date range. Filter by tx_count > 0 to skip empty slots before you build any time-series logic. Skipped slots with tx_count = 0 are real slot numbers in the sequence but produced no block.
04Strategy 1: DEX Arbitrage Between Raydium and Orca#
Raydium runs three programs. Most arb detectors are built on AMM v4 only, which means they miss all CLMM and CPMM volume. An arb that looks nonexistent on AMM v4 data may be live on CLMM. The backtest will never show it. We ran the simulation without the programVariant filter on the first pass. Results looked promising. Turned out a significant portion of the flagged opportunities were on CLMM pools, and AMM v4 can't fill those. Filter by programVariant before computing anything.
The strategy: when the same token pair trades at different implied prices on Raydium and Orca in the same slot, buy the cheaper side and sell the other. The minimum spread that matters is the combined fee floor: 0.25% Raydium AMM v4 plus 0.30% Orca Whirlpool equals 0.55%. Anything below that is noise.
The result set here is the raw opportunity distribution: how often the spread existed, how wide it was, and which pools it appeared on. This is not the same as profitability. The next step introduces fill delay.
The 2-slot delay is the honest assumption for a bot not co-located with the validator. On our Frankfurt bare metal, we see co-located bots operating closer to 1-slot lag. Both numbers are worth running through the simulation: the delta between them is the value of co-location for this specific strategy and market period.
What this simulation tells you is where the edge exists and what size requires it to be worth trading. What it does not tell you is that you'll capture it: Jito bundles mean other bots are bidding for the same slots.
05Strategy 2: Token Launch Sniping on PumpFun#
The strategy: detect new token launches at the first create instruction, enter at the bonding curve opening price, exit at a configured slot window. Price reconstruction uses virtual_sol_reserves / virtual_token_reserves. Not the real reserves. The virtual values include an offset that stabilizes the curve at low liquidity, and using real reserves gives you a different (wrong) number.
The raw block path gives you something the trading datasets don't: the failed competing transactions. When a new token launches and 50 bots try to snipe it simultaneously, most of those attempts fail. The raw block shows all of them. That competition context matters for evaluating how realistic your simulated entry actually was.
Run this across three months of launches and look at the pnl_15slot distribution. Most launches go to zero. The ones that don't tend to cluster in specific market conditions: high-volume days, particular launch patterns. That clustering is the actual backtest finding. Not a win rate, but a set of conditions worth filtering on before deploying capital.
06Strategy 3: Liquidation Windows on Kamino#
Most liquidatable positions on Kamino clear within a few slots of becoming eligible. The ones that sit underwater for 10 or more slots are where slower bots still find fills. The backtest tells you which regime you're operating in: competitive (sub-5-slot clearance) or not. That answer changes the infrastructure requirements before you write a line of live code.
Price reconstruction is the part that trips people. The usd_value column in NLN trading datasets is pre-computed from the nearest oracle to each slot, with the source recorded in price_source for auditability. Join position state against that column directly. No need to reconstruct oracle prices yourself. The dataset already did it. The liquidation threshold varies by collateral type; Kamino's risk config publishes the per-asset values and they're worth pulling rather than hardcoding a placeholder.
Run this on a high-volatility day and compare it to a quiet day. The lag distribution shifts. During sharp price moves, positions go underwater and clear within 1–2 slots. During slow periods the same positions can sit for 20+ slots. If your bot can only react in 3 slots, you're relevant in the second regime, not the first. The backtest tells you which one was more common in your target period.
07Strategy 4: LP Position P&L on Orca Whirlpool#
The fee income numbers from this simulation are an upper bound. JIT liquidity bots add and remove positions within a single slot, capturing fee income on high-volume swaps before IL accumulates. Those positions don't appear cleanly in pool_events because the open and close happen in the same block. The simulation sees the swap volume but misses those bot positions. Your passive LP P&L estimate will be higher than what a passive position would have actually earned.
With that in mind: for a given tick range, compute fee income earned while in range minus IL from price divergence. The pool_events table preserves tick and bin resolution. Use amount_in_ui / amount_out_ui from dex_trades for IL calculation. The decimal normalization trap applies here the same way it applies to price ratios.
The result shows fee income minus IL for that tick range over the holding period. Run it across several ranges to find where the fee-to-IL ratio was historically best.
08From DuckDB to Production: A Rust Scaffold#
If fill_delay_slots = 2 makes the strategy unprofitable in the Rust scaffold, the strategy is probably dead. The scaffold exists to kill strategies that only work on idealized timing assumptions, not to validate them. Run it at delay = 1, delay = 2, delay = 3. If the P&L collapses at delay = 2, you need co-location or a different strategy. Better to find that out here than six months into a live deployment.
DuckDB analysis implicitly assumes your fill always lands exactly N slots after signal. The Rust engine makes that assumption explicit and lets you stress-test it. That's the only reason to write it: not to replicate the DuckDB logic in Rust, but to force a commitment to the latency parameter and see what the edge profile looks like under different values.
This scaffold is not production code: it has no Jito bundle modeling, no position limits, no P&L accounting per-strategy. Those come later. The value here is that it forces you to commit to a fill_delay_slots number and see what the strategy earns under that constraint across a real block range.
For teams who want to run this against a live stream after the backtest passes, the same Geyser plugin infrastructure that powers the archive is what generates the live data. The BlockSource trait swaps from ArchiveSource to a Geyser stream without changing the strategy code.
09What the Backtest Can't Tell You#
Backtests find edges. They do not tell you whether you'll capture them.
The biggest gap is MEV competition. Your simulation assumes you are the only bot. In live trading, Jito bundle ordering means your fill price depends on what other searchers bid for priority in the same slot. A 2-slot fill delay in backtest is a parameter you set. In live trading it's the outcome of a priority fee auction you can't model from historical data.
JIT liquidity is the second gap, specific to LP simulations. Concentrated liquidity bots add and remove positions within a single slot and those positions don't appear cleanly in pool_events. The fee income estimate for a passive position is an upper bound because you're attributing fees to yourself that JIT bots captured in practice.
Slot skips affect any simulation that assumes slot N+2 always exists. The index.parquet shows tx_count = 0 for skipped slots. Check skip rate in your target date range before drawing conclusions about fill latency. It shifts with network conditions.
Geyser-to-submission latency is the one we see most often mismodeled. A 2-slot detection-to-fill delay in backtest often becomes 3 slots or more in production because Geyser stream subscription lag isn't accounted for. On co-located bare metal in Frankfurt, where the validator and your bot share a subnet, this gap is smallest. On a remote VPS it can exceed a full slot.
What the backtest does tell you: whether an edge exists at all, what size makes it worth trading, and which market conditions it survives. We've seen teams skip it and spend months debugging live performance only to discover the edge was never there in the historical data.
The historical data is already there. If you're building a strategy on Solana and you haven't tested it against real on-chain data at scale, you're skipping the most honest feedback loop available. The NLN Historical Raw Blocks archive covers genesis to yesterday, delivered as signed URLs within 24 hours. If you want to skip the raw parsing and start with DuckDB queries today, the NLN Trading Datasets cover Raydium, Orca, PumpFun, Kamino, and 40+ other programs with normalized columns and pre-computed prices.
PumpFun Data Analysis: Graduation Rates, Creator Wallets, and Bonding Curve Price Reconstruction (2026)
Yellowstone gRPC Providers Compared (2026): Latency, Decoded Streams & What Nobody Tells You Before You Go Live
Every benchmark in this blog runs against our public endpoints.
Spin up an RPC, WebSocket, or gRPC endpoint in under a minute. Flat pricing, no request caps. Reproduce the numbers for your own workload.