groundy
developer tools

make-look-scanned Simulates Scans in an Offline WASM File, Exposing PDF Provenance as a Pixel Check

make-look-scanned runs scan simulation in an offline 8 MB WASM file: skew, grain, JPEG artifacts, zero uploads. Scan appearance is no longer a proxy for physical provenance.

9 min · · · 4 sources ↓

The scan simulation pipeline in make-look-scanned fits in a single ~8 MB HTML file that runs offline, touches no server, and produces an image-only PDF with configurable skew, grain, paper tone, and JPEG artifacts. The output is indistinguishable from a real scanner’s output to anyone not running a pixel-level diff. That makes scan appearance a cheap, client-side reproducible artifact rather than a proxy for physical provenance, which is a problem for any verification workflow that treats the two as equivalent.

What does the effect pipeline apply to each page?

The tool rasterizes each PDF page to an image, runs a configurable effect pipeline, then reassembles the results into a new image-only PDF, stripping all selectable text in the process. That output structure is exactly what a basic flatbed scanner produces: pixels, not a PDF with embedded glyphs.

The default parameters target specific scanner artifact classes. Skew 0.6° reproduces misaligned paper placement; noise 0.08 adds sensor grain; JPEG quality 70 is aggressive enough to produce block artifacts without looking obviously degraded; edge shadow 0.15 mimics the light falloff around a flatbed scanner’s glass edge; paper tone 0.6 adds the warm off-white cast of document stock; blur sigma 0.4 softens rasterized edges the way an optical sensor would. Every parameter is user-configurable via CLI flags, so a caller can dial in more aggressive grain, a warmer tone, or a different compression ratio to match a specific scanner profile or test a downstream pipeline’s tolerance.

The effect sequence matters, not just the values. The parameter space is wide enough that outputs from different configurations can look quite different from each other, which complicates any statistical fingerprinting approach trying to attribute documents to a specific “scan” origin.

What are the two deployment modes?

The project ships a Go CLI that statically links MuPDF via cgo, and a browser WASM build packaged as a self-contained HTML file, each using a different rendering engine for rasterization.

The CLI uses go-fitz, which wraps MuPDF, for page rasterization. MuPDF is a mature PDF renderer with consistent output across complex documents with layered vector graphics, embedded CJK fonts, and transparency. The browser build cannot use go-fitz because MuPDF does not compile to WASM, so it substitutes PDF.js for rasterization and compiles only the Go effects pipeline to WASM. The two renderers differ at the pixel level, so CLI and browser outputs are visually equivalent but not byte-identical. For most use cases the distinction is irrelevant; if you are doing pixel-level consistency checks across build paths, or trying to reproduce a specific byte-identical output from a different environment, it matters.

How does the WASM browser build work?

The browser build inlines the WASM binary, Go’s runtime glue, and PDF.js including its web worker as base64 into a single HTML file; it runs offline in any modern browser without a CDN dependency or install step. The ~8 MB file size is the cost of bundling a PDF renderer, a Go runtime, and an effects pipeline into one artifact.

The processing chain: PDF.js rasterizes each page to a pixel buffer in a web worker, passes that buffer to the Go effects compiled to WASM, which applies the same skew, grain, tone, shadow, blur, and JPEG compression steps as the CLI, and the output is assembled into an image-only PDF inside the browser tab. Nothing is transmitted to any server unless the user explicitly exports or uploads the file downstream.

This is the architectural pattern the ExactPDF author documented for a 70-tool client-side PDF suite: zero outbound file payloads, verifiable by watching the DevTools Network tab during a processing run. That suite runs under $40/month for roughly 1,300 weekly users with approximately 80% organic traffic, because compute stays in the user’s browser.

What does determinism mean for this kind of tool?

The output is deterministic by default: the random seed is derived from the input PDF’s content hash, so the same file always produces the same scan. Passing --seed N yields a different but still reproducible look; the same file plus the same seed produces a byte-identical output PDF.

For testing pipelines that ingest scan-like documents, this is directly useful. A QA suite that needs consistent input fixtures can generate them once and reproduce them exactly from the same source PDF, rather than managing a library of pre-generated scan images. The seed parameter provides a controlled way to generate multiple distinct “scan” variants of the same document for broader fixture coverage.

The flip side involves verification. Statistical noise analysis is sometimes used to infer whether a scan originated from a specific device, based on the sensor noise signature. Because make-look-scanned’s noise is computed from a deterministic function of the document’s content hash, the noise pattern carries different information than physical scanner noise even if the distributions look similar to visual inspection. A verifier relying on noise-based fingerprinting would need to know what seed derivation function was used to distinguish synthetic from physical scanner noise, and for the default (content-hash) seed, that derivation is public.

Why does zero-upload matter beyond privacy claims?

Zero-upload WASM is an architectural guarantee, not a policy claim. If the processing chain never sends a file payload to a server, there is nothing for a server to log, breach, or subpoena. A developer auditing an internal tool can confirm the guarantee by watching the Network tab during a processing run, without trusting a privacy policy.

The cost argument is equally concrete. ExactPDF’s infrastructure runs under $40/month for over a thousand weekly users, because per-file compute never hits a server. A traditional server-side PDF pipeline at that volume would require at minimum a small container instance plus storage and egress costs. For scan simulation specifically, the per-file compute (rasterization plus multi-step pixel manipulation plus JPEG re-encoding) is non-trivial; routing it to the client removes it from the server’s cost envelope entirely.

What is the provenance gap this exposes?

Scan appearance and physical provenance are not the same property, and tooling like this sharpens that gap. A document that was physically printed, placed on a scanner, and scanned has a provenance chain: the scanner model’s specific sensor noise characteristics, the JPEG encoder parameters of that device, the timestamp embedded in the output, and typically metadata in the PDF container from the scanning application. A document processed by make-look-scanned has synthesized analogs of those artifacts, generated from parameterized algorithms rather than physical capture.

A paper presented at the SAICSIT 2024 Conference (arXiv:2507.00827) found that most PDF tampering-detection techniques rely on hashes or watermarks and cannot detect alterations to non-visual aspects such as PDF signatures or metadata. The paper’s prototype detects changes to text, images, and metadata using PDF file page objects, which is more thorough than hash comparison, but still not designed to distinguish a synthetic scan from a physical one. Image-only PDFs remove selectable text and the structural metadata that most detection schemes examine; what remains is pixel data and PDF container structure, neither of which encodes physical provenance.

Scanic, a Rust/WASM library solving the inverse problem (correcting physical scan distortion for document edge detection), achieves approximately 10 ms perspective transforms versus 500 ms or more in pure JavaScript loops, at under 100 KB gzipped versus OpenCV.js at 30 MB or more. Client-side image processing now runs non-trivial pipelines in milliseconds without a server. The cost of producing convincing document image effects, in either direction (synthesis or correction), has fallen to the point where it is accessible to any web developer.

Make-look-scanned is not designed as a fraud tool; its stated purposes (styling documents for print workflows, testing pipelines that process scan-like input, producing a consistent visual appearance) are legitimate. The narrower observation is that any verification workflow treating scan appearance as a proxy for physical provenance is relying on an assumption this class of tooling directly contradicts.

What do the licenses actually constrain?

The CLI binary statically links MuPDF via cgo and is licensed AGPL-3.0. AGPL includes a network use clause: if you run the binary as part of a service that users interact with over a network, you must offer the source code of your modified version. Distributing the compiled binary without offering corresponding source is a violation. For internal tooling that never crosses a network service boundary, AGPL is easier to accommodate; for anything offered as a hosted service, the source-availability requirement is real and enforceable.

The browser build uses PDF.js and does not include MuPDF, so it carries no AGPL obligation from MuPDF. You can host the HTML file or redistribute it without triggering MuPDF’s copyleft requirements.

The practical split is clear: teams building internal tools or who can accept AGPL terms get the CLI with MuPDF’s rendering quality. Teams distributing a product, wrapping this in a SaaS offering, or embedding it in a commercial application should use the browser build to avoid MuPDF’s copyleft terms.

Frequently Asked Questions

Can two separately processed copies of the same document be compared to detect synthetic scanning?

Yes. Physical scanners produce different sensor noise on every pass, so two scans of the same document never match at the pixel level. Make-look-scanned’s default seed is derived from the source PDF’s content hash, so two independently produced outputs from the same file have identical noise patterns, which a verifier comparing both copies could flag as evidence of synthetic processing.

Does the AGPL network use clause apply to a backend pipeline that processes documents without a user-facing interface?

AGPL’s network use clause is triggered when users interact over a network with a service running the covered program, so an automated pipeline that accepts user-submitted documents and returns results qualifies. A strictly internal workflow where employees run the CLI locally for their own documents is less likely to trigger the clause, though that line is fact-specific. The safest path for any user-facing document service is the browser build, which carries no MuPDF AGPL obligation.

How does Scanic’s browser WASM design differ architecturally from make-look-scanned’s effects pipeline?

Scanic splits work between CPU and GPU: Rust/WASM handles SIMD-accelerated Gaussian blur, Canny edge detection, and morphological dilation for locating document boundaries, while a custom Triangle Subdivision algorithm offloads perspective warping to the Canvas GPU path. Make-look-scanned runs its entire effects pipeline through WASM with no GPU offload. The split matters on mobile devices, where routing perspective correction to the GPU prevents main-thread jank during camera-captured document scanning.

What PDF metadata survives in an image-only output that could still identify the generating tool?

An image-only PDF still carries a document information dictionary with creator application, producer, and creation timestamp fields. Make-look-scanned populates these with its PDF library’s defaults unless the caller overrides them. A verifier examining the PDF container metadata rather than pixel data could identify the generating software even when the visual content passes casual inspection, and the page-object approach described in the SAICSIT 2024 paper would surface this alongside image and text change detection.

sources · 4 cited

  1. GitHub - marquaye/scanic: Modern Document Scanning library github.com community accessed 2026-06-24