Skip to content

Which AI Bots Are Crawling Your WordPress Site — And What Each One Actually Does

By Jasper Frumau WordPress

Pull up your WordPress site’s access logs and you’ll find something that wasn’t there two years ago: a parade of AI crawlers working through your pages. GPTBot, ClaudeBot, Amazonbot, meta-externalagent — dozens of automated requests a day from AI companies building knowledge bases, search indexes, and training datasets. We’ve been tracking this on imagewize.com since late 2025, and the numbers have grown every month.

But not all AI bots are doing the same thing. There’s an important distinction between a bot that’s training a model, one that’s building a search index for AI-generated answers, and one that fires because a real person just asked an AI assistant about your business. Knowing which is which changes how you should think about AI visibility — and what you should do about it.

Quick Summary: Over 8 weeks of server logs, AI crawlers accounted for 1–11% of imagewize.com traffic. Amazonbot is the single largest AI crawler by request volume — nearly half of all AI traffic — yet most WordPress site owners have never heard of it. This post breaks down what each bot does, why Amazonbot is crawling so hard, and what it means for your site’s AI search visibility.

In This Article

What We Found in 8 Weeks of Server Logs

We monitor Nginx access logs weekly using an open-source bash script, ai-bot-monitor.sh, that identifies AI crawler user agents and reports per-bot request counts, bandwidth, scraped pages, and robots.txt compliance. It’s part of our WP Ops toolkit on GitHub. Here’s the 24-hour snapshot from June 26, 2026 — a typical day, not a spike:

CrawlerCompanyRequests (24h)Bandwidth
AmazonbotAmazon2424.0 MB
ClaudeBotAnthropic801.6 MB
meta-externalagentMeta556.2 MB
OpenAI SearchBotOpenAI330.5 MB
ChatGPT-UserOpenAI300.8 MB
BytespiderByteDance240.5 MB
GPTBotOpenAI140.1 MB
YouBotYou.com90.1 MB
PerplexityBotPerplexity40.1 MB
Total49114 MB

That’s 5% of all requests and 10% of bandwidth — on a single ordinary day. Across 8 weeks of monitoring, AI crawlers have accounted for between 1.3% and 11.1% of total traffic, with the trend slowly climbing. Meta’s crawler only shows up at 55 requests but consumes 6.2 MB — it fetches full page content, not just API endpoints, which is why it punches above its weight on bandwidth. We covered the initial findings back in April, and the landscape has shifted meaningfully since then.

Not All AI Bots Are the Same

The most useful framework for thinking about AI crawlers is to ask: what is this bot actually building? There are three distinct categories, and they have different implications for your site.

Training Crawlers: Building Tomorrow’s AI Models

These bots harvest content to train large language models. The data they collect today becomes part of a model that might be released in 12–24 months. Being crawled by them doesn’t make you appear in today’s AI answers — it influences what future models know.

  • GPTBot — OpenAI’s training crawler. Feeds future GPT model versions. You can block it in robots.txt with User-agent: GPTBot / Disallow: / if you don’t want your content used for training.
  • ClaudeBot — Anthropic’s training crawler. Appears in our logs as the #2 crawler by volume, though some of those hits are user-initiated (explained below). Blocking it prevents your content from training future Claude models.
  • Bytespider — ByteDance (TikTok’s parent company). Powers their AI features. Robots.txt: User-agent: Bytespider.
  • Google-Extended — Specifically for Gemini model training, separate from Googlebot. Started appearing in our logs in June.

Search Index Crawlers: Powering AI-Generated Answers Right Now

These bots are building real-time or near-real-time knowledge bases that surface your content in today’s AI search answers. Being well-crawled by them means a higher chance of appearing when someone asks Perplexity, ChatGPT Search, or Amazon Rufus a question that relates to your business. This is the category that matters most for AI search visibility.

  • OpenAI SearchBot — Crawls for ChatGPT’s web search feature. If a ChatGPT user enables web browsing and asks a question in your niche, this index is what gets searched. We’ve seen it grow from 3–13 requests/day in April to 30–42/day by late June — a clear ramp-up.
  • PerplexityBot — Surprisingly quiet at just 4 requests in 24 hours, considering Perplexity is a major AI search product. It crawls selectively and relies partly on Bing’s index.
  • Amazonbot — More on this below. Amazon’s crawler powers Rufus (the AI shopping assistant in the Amazon app), Alexa+, and Amazon Bedrock’s grounding features. It’s the single biggest AI crawler we see.
  • YouBot — You.com’s AI search crawler. Started appearing in our logs in May and has grown to 9–18 requests/day.
  • meta-externalagent — Meta AI’s search crawler. Powers the AI assistant in WhatsApp, Instagram, and Facebook. Fetches full page content, which is why it consumes disproportionate bandwidth relative to its request count.

Real-Time User Fetches: Someone Just Asked About You

This category is the most interesting and least understood. These are requests that fire because a real human being, right now, asked an AI assistant about something and the AI went to look at your page in real time.

  • ChatGPT-User — When a ChatGPT user shares a URL with the chat (“summarize this page for me”) or asks a question that triggers a live web fetch, the request appears in your logs as ChatGPT-User. We got 30 of these in 24 hours — each one represents a real person using ChatGPT to access content on imagewize.com.
  • anthropic-ai — The equivalent for Claude users who share URLs in conversation. Separate from ClaudeBot (the autonomous crawler), though they share Anthropic’s IP ranges, which can make them hard to distinguish in raw log analysis.
  • ClaudeBot (user-triggered variant) — Here’s the ambiguous one. When ClaudeBot visits your site, it could be autonomous crawling or it could be a Claude user who asked Claude to look at your page. We use Claude ourselves and reference imagewize.com pages during development, which means some of our 80 ClaudeBot hits are self-generated. The original post from April covered this in detail.

Amazonbot: The One That Surprised Us

In our 24-hour snapshot, Amazonbot made 242 requests. That’s 49% of all AI crawler traffic — nearly half — from a bot that most WordPress tutorials don’t even mention. It’s been the top AI crawler in every single report we’ve run since April.

Why is Amazon crawling so aggressively? A few reasons:

  • Rufus — Amazon’s AI shopping assistant, built into the Amazon app, uses Amazonbot to understand what products and services exist beyond Amazon’s own catalog. If someone asks Rufus “what’s a good WordPress hosting service for my small business,” Rufus needs to know what’s out there. Imagewize sells managed WordPress hosting. That’s a directly relevant query.
  • Alexa+ — The AI-upgraded version of Alexa uses live web data to answer questions, not just a static knowledge base.
  • Amazon Bedrock — Amazon’s enterprise AI platform uses web grounding (live retrieval from the web) to give its models current information. Sites that Amazonbot can freely access get included in that retrieval pool.

For a small business, being well-indexed by Amazonbot is increasingly meaningful — especially if you sell services that Amazon customers might research. We offer managed WordPress hosting and WordPress SEO — both directly relevant to queries Amazon’s AI products would field. The crawl volume suggests Amazon is building something ambitious with its AI products, and they’re in active data-gathering mode.

What About Mistral and DeepSeek?

Two names are conspicuously missing from the table above: Mistral and DeepSeek, two of the most talked-about AI labs outside the American giants. We went looking for them specifically across two months of logs on two separate WordPress sites, and the results were a study in contrasts.

Mistral Isn’t Crawling at All

Mistral runs two crawlers: MistralAI-User, a live fetcher that fires when someone asks the company’s Le Chat assistant about a specific page, and MistralAI-Index, an indexing crawler for discovering content. We found exactly zero requests from either one — not on imagewize.com, not on a second WordPress site we monitor, across the full eight weeks. For a major AI lab, that absence is striking.

Mistral’s own documentation hints at why. The robots docs state that neither of its crawlers feeds model training — MistralAI-Index content is “not used for generative AI training of any kind” — and the company declines to disclose its training datasets at all. So if their own crawlers aren’t gathering training data, that data has to come from somewhere else. The likely explanation is that Mistral leans heavily on Common Crawl — the open, shared web archive built by the CCBot crawler — rather than running its own broad crawl. If that’s how Mistral sources most of its training data, then the way to reach Mistral’s models isn’t to court MistralAI-Index; it’s to make sure CCBot can access you and that your pages land in the Common Crawl archive. It’s also a reminder that a model can only answer with what it has actually read. If Mistral never crawls you directly and you’re not in Common Crawl, you simply won’t appear in its answers.

But here’s the catch we can verify from the other direction: our monitor tracks CCBot, and across two months on both sites it logged just six requests total. So Common Crawl’s own crawler is nearly as quiet as Mistral’s. Getting into the archive is real, but it happens on CCBot’s slow, broad schedule — you can’t court it the way you can an llms.txt-aware search crawler. The practical play is baseline hygiene: keep CCBot allowed in robots.txt, serve clean server-rendered HTML, and maintain solid sitemaps and internal links so your key pages are reachable whenever it does pass through.

DeepSeek Crawls — But Watch What It Asks For

DeepSeek tells the opposite story. DeepSeekBot does show up in our logs — modest volume, a couple of dozen requests across both sites over two months, crawling exactly the pages you’d expect: service pages, the about page, the sitemap. So far, so normal.

What stood out was the company those requests kept. Mixed in among the legitimate page fetches were hits for /.env.production, /_next/build-manifest.json, and /dist/.vite/manifest.json — paths that have nothing to do with reading content. Those are the kinds of files a scanner probes to fingerprint your tech stack or hunt for leaked secrets. Every one returned a 403 or 404 on our hardened setup, so nothing leaked. And it’s worth remembering that any scanner can spoof a crawler’s user-agent string, so some of this traffic may not be DeepSeek at all. Either way the lesson holds: a request wearing an AI crawler’s name isn’t automatically benign. Give it the same scrutiny as any other bot — verify by IP where you can, and keep sensitive paths locked down.

The broader takeaway: don’t assume the famous AI labs are all crawling you, and don’t assume the ones that are are only reading your content. The only way to know is to read your own logs.

What This Means for Your WordPress Site

Knowing who’s crawling you is only useful if you do something with that information. Three practical things to get right:

Make Your Content Easy to Extract

AI crawlers don’t render JavaScript — they read raw HTML. If your content is locked inside client-side rendering, custom Gutenberg blocks that output empty divs without JS, or page-builder layouts that rely on JS for structure, AI crawlers get nothing useful. Stick to server-rendered content in clean semantic HTML. For WordPress specifically, that means core Gutenberg blocks, well-structured headings, and avoiding heavy JS-dependent builders for core content pages.

Add an llms.txt File

The emerging standard for AI discoverability is llms.txt — a plain-text file in your site root that tells AI systems what your site is about and which pages matter most. We covered the full implementation in how to add llms.txt to WordPress, including for Bedrock and Trellis setups where the site root isn’t straightforward.

Add Article Schema to Blog Posts

The SEO Framework (which imagewize.com uses) does not automatically emit Article or BlogPosting schema for posts. AI crawlers that read structured data — including Google’s AI Overviews — benefit from explicit Article JSON-LD. We implemented this manually across all recent posts and documented the exact block pattern for anyone running a similar WordPress setup. It’s a one-time addition per post once you have the template.

Frequently Asked Questions

  • What is Amazonbot and why is it crawling my WordPress site? Amazonbot is Amazon’s web crawler. It feeds Amazon’s AI products — Rufus (the AI shopping assistant in the Amazon app), Alexa+, and Amazon Bedrock. It crawls service and business sites to build the knowledge base that powers AI-generated answers within Amazon’s ecosystem. It’s the most active AI crawler we see by request volume, often accounting for 40–50% of all AI bot traffic.
  • What is the difference between GPTBot and ChatGPT-User? GPTBot is OpenAI’s autonomous training crawler — it crawls your site to collect data for future model training with no human involved. ChatGPT-User fires when a real ChatGPT user triggers a live web fetch, for example by sharing a URL and asking ChatGPT to summarize it. GPTBot builds training data; ChatGPT-User represents an actual person reading your content via ChatGPT right now.
  • Should I block AI crawlers in my WordPress robots.txt? It depends on your goal. If you don’t want your content used for model training, you can block GPTBot, ClaudeBot, and Bytespider. But blocking search-index crawlers like OpenAI SearchBot, Amazonbot, or PerplexityBot will exclude your site from AI-generated search answers — which is increasingly where attention goes. Most businesses should allow search index crawlers and use llms.txt to guide them, while optionally blocking pure training crawlers.
  • How do I check which AI bots are visiting my WordPress site? The most reliable method is reading your Nginx or Apache access logs directly. WordPress plugins don’t capture all bot traffic. On a Linux server, you can grep your access log for known AI user-agent strings: grep -iE "GPTBot|ClaudeBot|Amazonbot|PerplexityBot|Bytespider|meta-externalagent|ChatGPT-User|YouBot|MistralAI-User|MistralAI-Index|DeepSeekBot" /path/to/access.log | wc -l. For ongoing monitoring we use our open-source ai-bot-monitor.sh script (part of the WP Ops toolkit on GitHub) — it identifies each crawler, tallies requests and bandwidth, lists scraped pages, and checks robots.txt compliance.
  • Does Mistral AI crawl my WordPress site? In two months of server logs across two WordPress sites, we recorded zero requests from either of Mistral’s crawlers — MistralAI-User (its live Le Chat fetcher) and MistralAI-Index (its indexing crawler). Mistral appears to rely largely on Common Crawl, the open web archive built by CCBot, rather than running its own broad crawl. But CCBot is itself a rare visitor — just six requests across both sites in those two months — so getting into the archive happens on its slow, broad schedule rather than because you courted it. The practical play is baseline hygiene: keep CCBot allowed in robots.txt, serve clean HTML, and maintain solid sitemaps and internal links so your key pages are reachable whenever it passes through.

Need WordPress SEO Support for Your Business?

We handle WordPress SEO for SMEs — from technical foundations (schema, crawlability, Core Web Vitals) to on-page optimization and content strategy. Fixed-price audits and ongoing support available.

  • Technical SEO audit and implementation
  • Schema markup and structured data
  • Core Web Vitals and page speed optimization
  • On-page SEO and content strategy

Leave a Reply

Your email address will not be published.