Fundamentals
What is metadata, and why it often leaks more than the file itself
You redacted the sensitive text in a Word document and sent the PDF. Turns out the Word document's tracked changes went with it. A practical tour of the metadata that rides along invisibly on the files you share.
On this page
“Metadata” is data about data: who made the file, when, on which device, with which software, by whose account, and — in ways people rarely expect — containing fragments of content that the visible part of the file does not show.
Metadata has been embarrassing people for three decades. It is how prosecutors discover that a “final” PDF was edited after the dated signature. It is how journalists discover which government office produced a leaked memo. It is how abuse photos are traced to the phone that took them. And it is how private business deals show up on the front page of the newspaper because the PDF kept the Word comments.
Office documents
Microsoft Word, Excel, and PowerPoint files carry a surprising amount of extra information:
- Author name and company — often a real employee’s name, pulled from the Office install.
- Track changes and comments — edits, questions, and snarky comments made during review, preserved even after you click “Accept all”.
- Revision history — previous versions of the document, accessible via metadata even though they are not visible in the main view.
- Embedded files — a PowerPoint that has an Excel chart embedded may carry the entire underlying spreadsheet.
- File paths and printer names from the computer where the file was saved.
- Document template used, including sometimes the full path to the network template share.
Microsoft’s Document Inspector, built into every recent version of Office (File → Info → Check for Issues → Inspect Document), finds and removes most of this. It is free. Take the thirty seconds. For LibreOffice, the equivalent is under Tools → Automatic Redaction and File → Properties.
PDFs
Converting a Word doc to PDF does not strip the metadata. It preserves most of it in the PDF and adds its own, including:
- Author, creator application, subject, and keywords.
- Creation and modification timestamps.
- Whether and when the PDF was digitally signed.
- XMP metadata — a structured standard-format block that many PDF processors populate aggressively.
- In PDFs generated from scanned documents, OCR’d text layers that contain words you thought were only in the image.
The NSA published a public guide titled Hidden Data and Metadata in Adobe PDF Files specifically because this is such a common source of classification spills inside government. The same lessons apply to anyone sharing sensitive PDFs.
Photos and images
Modern cameras and smartphones stamp a lot of information into the files they create, in a metadata format called EXIF:
- The GPS coordinates where the photo was taken, often accurate to a few meters.
- The timestamp and timezone.
- The make, model, and sometimes serial number of the camera.
- The camera settings used.
- A thumbnail that is sometimes the unedited original image — which means that a “cropped” or “blurred” photo can still contain the uncropped thumbnail inside.
For most social networks, the platform strips EXIF on upload. For files you upload yourself, share directly, or send as email attachments, EXIF goes along for the ride. When a photo needs to be shared anonymously — or just without your home address attached — strip EXIF first. Most operating systems have a built-in “remove location data” option in the file properties; open-source tools like ExifTool do the job precisely.
Other places metadata hides
- Printed-then-scanned PDFs. The printer’s identifier and job history can be embedded. Some color printers have even embedded per-dot tracking codes for decades.
- Recording software. Screen recording tools sometimes embed the account email, display configuration, and installed-software list.
- Email headers. An email forwarded to someone new carries every
Received:header from its entire journey, which can reveal internal mail server names, IP addresses, and past recipients. - Spreadsheets with filters. A filter hiding rows does not remove them. The hidden rows are in the file, readable by anyone who unhides.
- Git repositories accidentally committed to public sites. The
.gitdirectory in a ZIP archive is a full history of every file that has ever been in the repository.
A workflow that reduces the risk
You cannot remember to strip metadata on every file every time. What you can do is build habits that make leaks less likely:
- Default to PDF for documents you send externally, and run an inspector on the PDF before sending (Acrobat Pro’s “Sanitize Document” is the heavy-handed version; it removes nearly everything).
- Before clicking “accept all changes” in Word, actually accept them — then run Document Inspector to confirm. “Rejected” changes are stored in revision history; only “accepted” or “removed” changes are truly gone.
- For photos, default to sharing via platforms that strip EXIF unless there is a reason to preserve it. When emailing a photo as an attachment, strip EXIF first.
- Before redacting anything, ask: could this content be under the visible layer? If yes, use a real redaction tool and flatten the output.
- When in doubt, print to a new PDF. Many PDF “print to PDF” exports discard structured metadata because they regenerate the file from scratch. It is a crude method, but it usually works as a last resort.
Metadata is one of the few areas where a small amount of attention gets you a very large amount of safety. The cost is thirty seconds per sensitive file. The embarrassment of leaking something you thought was redacted is considerably more than that.