Handling Redacted Information

When publishing documents and images on the web, it's crucial to understand that information invisible to the human eye might still be accessible to search engines. This includes content you intended to redact but didn't remove completely. Let's explore why this happens and how to prevent sensitive information from appearing in search results.

Why "Hidden" Information is Still Discoverable

Several factors contribute to this phenomenon:

  • Search engines index more than meets the eye: Search engines crawl websites and index not just the visible content but also underlying data like metadata, image data, and even the text within images.

  • Assistive technologies expose hidden content: Screen readers, designed for visually impaired users, can access and vocalize text hidden by visual means, making it accessible to a wider audience.

  • Image analysis extracts hidden text: Optical Character Recognition (OCR) technology can identify and extract text embedded within images, regardless of font size, color, or attempts to obscure it with overlays.

Common Redaction Mistakes

While these methods might appear to hide information, they don't effectively redact it from search engines:

  • Tiny Font: Shrinking text to a size barely perceptible doesn't remove it; it simply makes it harder to see without specialized tools.

  • Matching Font and Background Color: Using the same color for text and background might render the text invisible to the naked eye, but search engines will still recognize and index it.

  • Covering Text with Images: Placing an image over text might create the illusion of redaction, but the underlying text remains embedded in the document and accessible to search engine crawlers.

Document Specific Challenges

Beyond these common pitfalls, different document formats pose unique challenges:

  • Change History: Many document formats retain a history of edits, potentially exposing previously redacted or altered information.

  • Image Cropping: Cropping an image might seem like a sufficient redaction method, but the full, uncropped version could still reside within the document's data.

  • Metadata: Metadata, often invisible to users, can contain sensitive information like author names, revision dates, and even the names of individuals who accessed or edited the file.

These issues persist even when converting or exporting documents to different formats.

Best Practices for Effective Redaction

To ensure information remains truly redacted, follow these best practices:

1. Image Redaction Before Embedding

Problem: Redacting images after embedding them into documents often leads to incomplete redaction, as the original, unredacted image might still be stored within the document.

Solution:

  • Crop Out Sensitive Information: Before embedding an image, use an image editor to crop out any unwanted elements.

    • Example: Imagine a photograph of a confidential document. Instead of simply placing a black box over sensitive text within the document, crop the image so only the non-sensitive portions remain.

  • Obscure or Remove Remaining Text: For text that cannot be cropped out, completely remove or obscure it using image editing tools.

    • Example: If the image contains a street sign with an address you need to redact, use the clone stamp tool or other editing features to remove the address entirely, replacing it with a seamless continuation of the background.

  • Remove Metadata: Before saving the edited image, remove any potentially sensitive metadata. Most image editors have options for removing or scrubbing metadata.

Export Format: Save redacted images in non-vector or flattened formats like PNG or WEBP. These formats prevent hidden layers or embedded information from being carried over.

2. Thorough Text Redaction

Problem: Simply hiding text visually doesn't remove it from the underlying document, leaving it discoverable by search engines.

Solution:

  • Use Proper Redaction Tools: Avoid using black rectangles or other visual techniques to cover up text. Employ dedicated document redaction tools that completely remove the underlying text data.

    • Example: Instead of drawing a black box over sensitive text in a PDF, use a PDF redaction tool that overwrites the selected text with blank space, effectively removing it from the document's code.

  • Double-Check Metadata: After redacting text, carefully review the document's metadata to ensure no sensitive information remains.

3. Additional Considerations

  • URLs and File Names: Avoid using sensitive information like email addresses or names in URLs and file names. Even if content is blocked from indexing, URLs themselves might still appear in search results. Consider using generic names and hashes instead.

    • Example: Instead of naming a file "confidential_report_john.doe.pdf," opt for a generic name like "report_12345.pdf."

  • Access Control: Limit access to sensitive documents using authentication methods. For extra security, add a "noindex" robots meta tag to the login page to prevent it from being indexed.

  • Google Search Console: Verify your website in Google Search Console. This provides access to tools like the Removals tool, which allows for faster removal of accidentally published sensitive content.

4. Dealing with Indexed Sensitive Content

If unredacted or improperly redacted documents appear in search results:

  • Remove the Document: Immediately take down the document from your website.

  • Use the Removals Tool: In Google Search Console, utilize the Removals tool to request the removal of specific URLs or URL prefixes from search results.

  • Host the Redacted Version Under a New URL: Once you've properly redacted the document, publish it under a different URL. This ensures search engines index the updated, redacted version.

  • Contact Other Websites: If other sites host the sensitive document, request they remove it as well.

  • Consider the Outdated Content Tool: Use the Outdated Content Tool to notify Google about outdated information, including improperly redacted content on external websites.

By understanding how search engines index content and following these best practices, you can significantly reduce the risk of unintentionally exposing sensitive information.

Last updated