PumpFun Data Analysis: Graduation Rates, Creator Wallets, and Bonding Curve Price Reconstruction (2026)
Analyze PumpFun launch data: graduation rates, creator wallet clustering, bonding curve price reconstruction, is_mayhem_mode, DuckDB queries.
On this page +
PumpFun's bonding curve is deterministic. Every trade commits the post-trade reserve state on-chain. That means the full price history for any token, from first buy to graduation, is reconstructible from pumpfun_events without oracles, without CEX feeds, without any external source. The record is complete.
The harder problem is that thousands of tokens launch per day and most follow the same arc: price spikes in the first few minutes, fades within an hour, flat line after. Individual token analysis at that scale is mostly noise. The useful work is aggregate: graduation rates by market condition, creator wallet patterns, early-buyer entry price distributions across cohorts. That's what the data makes tractable, and what this article covers.
01What the Data Covers#
Two event types make up the pumpfun_events dataset: creates and trades.
Create events are token launches. One row per token, capturing everything from the bonding curve initialization: the mint address, the creator wallet, the initial virtual and real reserves, the is_mayhem_mode flag, the on-chain metadata.
Trade events are buys and sells through the bonding curve. One row per transaction, with the full reserve state after each trade, the SOL and token amounts, fee breakdown, and the instruction name. When the bonding curve accumulates approximately 85 SOL in real reserves, graduation triggers. The final trade row for that mint carries ix_name = 'migrate' or 'complete_bonding_curve'. At that point the bonding curve closes, and the accumulated SOL plus the remaining token supply seed a new PumpSwap AMM pair. That is the graduation event.
After graduation, the token trades on PumpSwap as a standard AMM pair. That history continues in the PumpSwap dex_trades table, not in pumpfun_events. If you're building a full price history across a token's life, you need both tables and a join on mint.
Coverage starts from PumpFun's launch on Solana. Daily updates land by 06:00 UTC the following morning.
02How the Bonding Curve Prices Trades#
Every PumpFun token opens with the same initial virtual reserves: approximately 30 SOL on the SOL side and 1.073 billion tokens on the token side. The opening price works out to roughly 0.000000028 SOL per token. That number is small by design. Early buyers get in cheaply; the curve prices them out as real SOL flows in.
The mechanism is a constant-product curve. As buyers add SOL, virtual_sol_reserves rises and virtual_token_reserves falls. Because the product of the two stays constant, each successive buy is more expensive than the one before it. The price impact of the tenth SOL into a token is larger than the first, and the impact of the eightieth SOL is larger still. By graduation the price is roughly 14 to 15 times the opening price, given standard initial reserve values. That multiple comes directly from the math: price_multiple = ((vsr_0 + sol_added) / vsr_0)^2. At 85 SOL added to a 30-SOL virtual base, that is (115 / 30)^2 = about 14.7x.
The curve's constant k = virtual_sol_reserves * virtual_token_reserves is fixed at launch. From k and the initial reserves you can compute the expected price at any SOL fill level without scanning the trade history:
The price_multiple_vs_open column shows how many times the opening price a buyer at that fill level is paying. Run this for a few tokens in your dataset to verify the curve shape before building any strategy that depends on entry price assumptions. The standard initial reserves are consistent across most launches, but is_mayhem_mode tokens use modified parameters and will show a different curve shape entirely.
03Create Event Schema#
The fields we reach for first in any new PumpFun analysis are mint, creator, virtual_sol_reserves, virtual_token_reserves, is_mayhem_mode, and slot. That covers token identity, launch price, segmentation, and timing. The full schema has more, but most queries start with that subset and pull in other fields as needed.
Every row in pumpfun_events_creates represents one token launch. The fields worth understanding before writing any query:
Mint and identity
mint: the SPL mint address. Primary key across all PumpFun tables.bonding_curve: the PDA holding bonding curve state for this token. Each mint has exactly one.creator: the wallet that signed the create transaction. Many launches today go through third-party launch services that sign on behalf of the actual project team. A single creator address launching hundreds of tokens per day is almost certainly a proxy, not a person.name,symbol,uri: on-chain metadata.uripoints to a JSON file on IPFS or Arweave with the full metadata including image. Not guaranteed to resolve. Many tokens orphan their metadata within days, so write your pipeline to handle a 404 gracefully rather than treating it as an error.
Reserve state at launch
virtual_sol_reserves: the SOL side of the bonding curve at initialization, including the virtual offset built into the curve. This is the value you use for price calculation.virtual_token_reserves: the token side at initialization, including virtual offset.real_token_reserves: the actual token balance held by the bonding curve account. No virtual component. This is what the account physically holds.token_total_supply: fixed token supply minted at launch.
The virtual/real distinction is the most common source of wrong prices in PumpFun analysis. Price at any point on the bonding curve is virtual_sol_reserves / virtual_token_reserves. Using real_token_reserves in that formula gives a different number. It's not a rounding error, it's a structurally different value, and there's no exception thrown when you use it.
Flags and timing
is_mayhem_mode: boolean, set at launch. Mayhem mode tokens have modified bonding curve parameters. The curve shape, graduation behavior, and effective price range differ from standard launches. Any analysis mixing mayhem mode with standard launches produces a blended distribution that describes neither type accurately.slot,block_time,timestamp: three time references.block_timeis the on-chain Unix timestamp from the block header.timestampis the program-recorded timestamp. For joining with other datasets,slotandblock_timeare the reliable columns.tx_index: position of this transaction within its block. The only reliable way to order multiple events within the same slot.
Additional fields
signature: transaction signature. Useful for cross-referencing with block explorers.user: wallet that triggered the create. Often the same ascreator, but can differ through certain launch interfaces.
04Trade Event Schema#
For price reconstruction, the three fields that matter most are virtual_sol_reserves, virtual_token_reserves, and tx_index. The reserve columns give you the post-trade state after every buy or sell. tx_index is what lets you order events within the same slot correctly. Everything else in this table is supplementary until you get to fee analysis or user-level attribution.
is_buy: boolean.truemeans SOL in, tokens out.falsemeans tokens in, SOL out.sol_amount: raw lamports exchanged. Divide by1e9for SOL units. The_uivariant (sol_amount_ui) handles this for you if available in the dataset version you're using.token_amount: raw token base units exchanged. Divide by10^decimals(typically 6) for human-readable units. Usetoken_amount_uito skip the math.virtual_sol_reserves,virtual_token_reserves: reserve state after this trade. Use these for post-trade price. For pre-trade price, fetch the prior row's reserves usingLAG()ordered by slot andtx_index.fee_basis_points: protocol fee rate. 100 = 1%.fee: protocol fee for this trade in lamports.creator_fee_basis_points: creator-set fee rate. Zero if the creator didn't set one.creator_fee: creator fee for this trade in lamports. Zero whencreator_fee_basis_pointsis zero.ix_name: instruction name.'buy','sell', or'migrate'/'complete_bonding_curve'at graduation.user: wallet that executed the trade.
05What the Schema Doesn't Tell You#
The thing to get right before your first real query is which reserve values do what. real_token_reserves is the bonding curve account's actual token balance. virtual_sol_reserves and virtual_token_reserves include the virtual offset that keeps the curve well-behaved at low liquidity. Price at any point is virtual_sol_reserves / virtual_token_reserves. Use real reserves in that formula and you get a different number with no error thrown. We've seen this trip up analysts more than any other field in the dataset.
The creator field has a similar surface-area issue. It's the signing wallet, not necessarily the project. When we ran creator-level analysis on a month of creates, the top 20 creator addresses accounted for a disproportionate share of total launches. All of them were services. Run a launch-frequency distribution first before building any creator-level conclusions. Any wallet with thousands of launches is a proxy, not a person.
If you're building a full price history across a token's life, the graduation point is where the decimal handling changes. On the bonding curve, price is virtual_sol_reserves / virtual_token_reserves, normalized by the reserve structure. After graduation, trading moves to PumpSwap dex_trades where sol_amount is in lamports and token_amount in token base units. A raw sol_amount / token_amount ratio there is wrong by 10^(9 - decimals), which is 1000x off for a 6-decimal token. Use _ui columns for post-graduation trades, virtual reserve ratio for everything before.
Two smaller edge cases. block_time is identical for all transactions in a slot, so MIN(block_time) won't resolve intra-slot ordering. Use tx_index instead, which is the transaction's position within the block. And Token-2022 tokens with transfer hooks record the instruction input in token_amount, not the post-hook amount. For most aggregate analysis the difference is negligible. For precise fee accounting on affected tokens it matters.
06Getting the Data#
The pumpfun_events tables are part of the NLN Trading Datasets: pre-parsed Parquet files, partitioned by day, delivered as tar.zst archives via signed URLs. DuckDB reads them directly without an extraction step.
The create and trade tables ship separately: pumpfun_events_creates_YYYY-MM-DD.parquet and pumpfun_events_trades_YYYY-MM-DD.parquet. Join them on mint when you need the full picture from launch through trading history.
All queries in this article run against extracted Parquet files using the glob pattern *.parquet. For pandas users, individual files load with pd.read_parquet(). The glob patterns only apply in DuckDB.
07DuckDB Analysis Patterns#
Five queries for the most common PumpFun analysis needs.
Launch volume by day
Run this first on any new dataset range. The is_mayhem_mode split tells you what fraction of launches to exclude from standard analysis before writing any further queries.
Price path reconstruction for a single token
The LAG() window function gives you the pre-trade price, which is what a buyer saw before submitting. The delta between price_pre_trade and price_post_trade is the price impact of that trade on the curve.
Graduation rate analysis
Run this across a month and compare graduation_rate_pct between mayhem mode and standard launches. The avg_minutes_to_grad column tells you whether the tokens that did graduate did so quickly (sniped to graduation within minutes) or over a longer period (organic accumulation over hours). Those two paths look very different in the early buyer data.
Creator wallet distribution
The creator_type thresholds here are illustrative. Adjust them after looking at the actual distribution in your dataset. The key insight is almost always the same: a small number of creator addresses account for a large fraction of total launches, and those addresses are services, not people.
Early buyer entry analysis
unrealized_peak_return here is the theoretical return if you sold at peak price. In practice early buyers rarely sell at peak, but the distribution of unrealized_peak_return across buy ranks 1 through 5 across a large token cohort tells you whether first-mover advantage was real in your target period. Filter to a specific date range before running this at scale: the peak_prices CTE scans the entire trades Parquet for every mint in the dataset.
08Why Individual Token Analysis Fails at Scale#
PumpFun data at scale inverts the usual framing. Individual token analysis is rarely the productive question. The data is too noisy at the individual level: thousands of tokens launch per day, most follow the same pattern, and the few that don't aren't predictable from the create event alone.
The productive framing is cohort analysis: group launches by some characteristic (mayhem mode vs not, launch time of day, creator type, market activity level) and compare outcomes across cohorts. The questions that have real signal are aggregate ones.
When we've done cohort analysis on graduation rates, the clearest signal appears to be launch conditions and timing rather than anything in the token's metadata. Token name and symbol are noise at the aggregate level. Market-wide activity at the time of launch seems to matter more than anything in the create event itself. The following query tests this directly: it groups launches into quintiles by how busy the launch day was, then compares graduation rates across those quintiles.
Quintile 1 is the quietest days. Quintile 5 is the busiest. The direction of the relationship shifts with market conditions. Run this across your target period before drawing any conclusions about when to deploy capital. The signal shifts, and what held in one quarter may not hold in the next.
When you do find a cohort with better-than-average graduation rates or sustained price action, the next step is usually digging into the raw block data to see the failed competing transactions: who else was trying to get into the same launches, and what did the competition look like. That's where the NLN Historical Raw Blocks archive comes in. The parsed pumpfun_events table tells you what succeeded. The raw blocks show you everything that tried.
If you're building on PumpFun data at scale, the NLN Trading Datasets include pumpfun_events creates and trades as DuckDB-ready Parquet, partitioned by day, covering from PumpFun's launch to yesterday. For research that requires the full context including failed transactions and competing snipe attempts, the raw block archive at NLN Historical Raw Blocks has the complete ledger.
Every benchmark in this blog runs against our public endpoints.
Spin up an RPC, WebSocket, or gRPC endpoint in under a minute. Flat pricing, no request caps. Reproduce the numbers for your own workload.