🔎
Google Search for beginners
Home
  • Introduction
  • Google Search Essentials
    • Overview
    • Google Search Technical Requirements
    • Spam Policies
  • SEO Basics
    • SEO Beginner's Guide
    • How Google Search Works
    • Creating Helpful, Reliable Content
    • Do You Need an SEO Expert?
    • Maintaining Your Website’s SEO
    • Developer's Guide to Google Search
    • How to Get Your Website Listed on Google
  • crawling and indexing
    • Overview
    • File formats Google can index
    • URL structure
    • Links
    • Sitemaps
      • Create and submit a sitemap
      • Manage your sitemaps
      • Image-specific sitemaps
      • News-oriented sitemaps
      • Video sitemaps and alternatives
      • Combining different sitemap types
    • Managing Google Crawlers
      • Reducing the crawl rate of Googlebot
      • Verifying the Googlebot and other Google crawlers
      • Managing Crawl Budget for Large Sites
      • HTTP Status Codes, Network, and DNS Errors
      • Types of Google Crawlers
      • Googlebot Explained
      • Google Read Aloud Service
      • Google API
      • Understanding Feedfetcher
    • Robots.txt
      • Creating and Submitting Robots.txt
      • Updating Robots.txt
      • Google's Interpretation of Robots.txt
    • Canonicalization
      • Specifying Canonicals Using rel="canonical" and Other Methods
      • Resolving Canonicalization Issues
    • Canonicalization for Mobile Sites and Mobile-First Indexing
    • AMP (Accelerated Mobile Pages)
      • Understanding How AMP Works in Search Results
      • Enhancing Your AMP Content
      • Validating AMP Content
      • Removing AMP Content
    • JavaScript
      • Fixing Search-Related JavaScript Issues
      • Resolving Issues with Lazy-Loaded Content
      • Using Dynamic Rendering as a Workaround
    • Page and Content Metadata
      • Meta Tags
      • Using Robots Meta Tag, data-nosnippet, and X-Robots-Tag noindex
      • noindex Explained
      • rel Attributes
    • Removals
      • Removing Pages from Search Results
      • Removing Images from Search Results
      • Handling Redacted Information
    • Redirects and Google Search
      • Switching Website Hosting Services
      • Handling URL Changes During Site Moves
      • A/B Testing for Sites
      • Pause or Disable a Website
Powered by GitBook
On this page
  1. crawling and indexing
  2. Removals

Handling Redacted Information

Keep Redacted Information Out of Google Search

When publishing documents and images on the web, it's crucial to understand that information invisible to the human eye might still be accessible to search engines. This includes content you intended to redact but didn't remove completely. Let's explore why this happens and how to prevent sensitive information from appearing in search results.

Why "Hidden" Information is Still Discoverable

Several factors contribute to this phenomenon:

  • Search engines index more than meets the eye: Search engines crawl websites and index not just the visible content but also underlying data like metadata, image data, and even the text within images.

  • Assistive technologies expose hidden content: Screen readers, designed for visually impaired users, can access and vocalize text hidden by visual means, making it accessible to a wider audience.

  • Image analysis extracts hidden text: Optical Character Recognition (OCR) technology can identify and extract text embedded within images, regardless of font size, color, or attempts to obscure it with overlays.

Common Redaction Mistakes

While these methods might appear to hide information, they don't effectively redact it from search engines:

  • Tiny Font: Shrinking text to a size barely perceptible doesn't remove it; it simply makes it harder to see without specialized tools.

  • Matching Font and Background Color: Using the same color for text and background might render the text invisible to the naked eye, but search engines will still recognize and index it.

  • Covering Text with Images: Placing an image over text might create the illusion of redaction, but the underlying text remains embedded in the document and accessible to search engine crawlers.

Document Specific Challenges

Beyond these common pitfalls, different document formats pose unique challenges:

  • Change History: Many document formats retain a history of edits, potentially exposing previously redacted or altered information.

  • Image Cropping: Cropping an image might seem like a sufficient redaction method, but the full, uncropped version could still reside within the document's data.

  • Metadata: Metadata, often invisible to users, can contain sensitive information like author names, revision dates, and even the names of individuals who accessed or edited the file.

These issues persist even when converting or exporting documents to different formats.

Best Practices for Effective Redaction

To ensure information remains truly redacted, follow these best practices:

1. Image Redaction Before Embedding

Problem: Redacting images after embedding them into documents often leads to incomplete redaction, as the original, unredacted image might still be stored within the document.

Solution:

  • Crop Out Sensitive Information: Before embedding an image, use an image editor to crop out any unwanted elements.

    • Example: Imagine a photograph of a confidential document. Instead of simply placing a black box over sensitive text within the document, crop the image so only the non-sensitive portions remain.

  • Obscure or Remove Remaining Text: For text that cannot be cropped out, completely remove or obscure it using image editing tools.

    • Example: If the image contains a street sign with an address you need to redact, use the clone stamp tool or other editing features to remove the address entirely, replacing it with a seamless continuation of the background.

  • Remove Metadata: Before saving the edited image, remove any potentially sensitive metadata. Most image editors have options for removing or scrubbing metadata.

Export Format: Save redacted images in non-vector or flattened formats like PNG or WEBP. These formats prevent hidden layers or embedded information from being carried over.

2. Thorough Text Redaction

Problem: Simply hiding text visually doesn't remove it from the underlying document, leaving it discoverable by search engines.

Solution:

  • Use Proper Redaction Tools: Avoid using black rectangles or other visual techniques to cover up text. Employ dedicated document redaction tools that completely remove the underlying text data.

    • Example: Instead of drawing a black box over sensitive text in a PDF, use a PDF redaction tool that overwrites the selected text with blank space, effectively removing it from the document's code.

  • Double-Check Metadata: After redacting text, carefully review the document's metadata to ensure no sensitive information remains.

3. Additional Considerations

  • URLs and File Names: Avoid using sensitive information like email addresses or names in URLs and file names. Even if content is blocked from indexing, URLs themselves might still appear in search results. Consider using generic names and hashes instead.

    • Example: Instead of naming a file "confidential_report_john.doe.pdf," opt for a generic name like "report_12345.pdf."

  • Access Control: Limit access to sensitive documents using authentication methods. For extra security, add a "noindex" robots meta tag to the login page to prevent it from being indexed.

  • Google Search Console: Verify your website in Google Search Console. This provides access to tools like the Removals tool, which allows for faster removal of accidentally published sensitive content.

4. Dealing with Indexed Sensitive Content

If unredacted or improperly redacted documents appear in search results:

  • Remove the Document: Immediately take down the document from your website.

  • Use the Removals Tool: In Google Search Console, utilize the Removals tool to request the removal of specific URLs or URL prefixes from search results.

  • Host the Redacted Version Under a New URL: Once you've properly redacted the document, publish it under a different URL. This ensures search engines index the updated, redacted version.

  • Contact Other Websites: If other sites host the sensitive document, request they remove it as well.

  • Consider the Outdated Content Tool: Use the Outdated Content Tool to notify Google about outdated information, including improperly redacted content on external websites.

By understanding how search engines index content and following these best practices, you can significantly reduce the risk of unintentionally exposing sensitive information.

PreviousRemoving Images from Search ResultsNextRedirects and Google Search

Last updated 11 months ago