Skip to content

AI Bots Are Eating the Web — But Not Ours: What We Found Out

By Jasper Frumau DevOps

You may have seen the headlines. “AI bots are eating the web.” The narrative is spreading fast: AI companies are quietly consuming the internet, scraping content to train models and power AI search answers, and your server is paying for it. Cloudflare’s own data shows around 31% of all web traffic is bot traffic — and AI-specific crawlers are a fast-growing slice of that. GPTBot alone jumped from 4.7% to 11.7% of all AI crawler traffic in a single year.

So we checked our own Nginx access logs.

Our Numbers: 24 Hours on imagewize.com

We pulled a 24-hour window from our production Nginx access log and ran an analysis specifically targeting known AI crawler user agents. Here is what we found.

CrawlerRequestsBandwidth
Anthropic ClaudeBot1191.57 MB
Amazon Amazonbot1062.09 MB
PerplexityBot210.43 MB
OpenAI ChatGPT-User190.47 MB
OpenAI GPTBot60.14 MB
OpenAI SearchBot6
ByteDance Bytespider50.03 MB
Meta meta-externalagent20.10 MB

Total: 284 AI crawler requests out of 10,635 — 2.7% of traffic, 5.4% of bandwidth.

Not 50%. Not even close.

The 7-Day View

A single day can be noisy. Running the same analysis over a full workweek gives a more reliable picture.

CrawlerRequests (7 days)Bandwidth
Anthropic ClaudeBot6769.66 MB
Amazon Amazonbot3897.77 MB
Meta meta-externalagent12912.79 MB
OpenAI ChatGPT-User922.28 MB
PerplexityBot931.85 MB
ByteDance Bytespider690.95 MB
OpenAI SearchBot600.37 MB
OpenAI GPTBot220.61 MB

Total: 1,530 AI requests out of 58,526 — 2.6% of traffic, 6.6% of bandwidth.

For context: Googlebot made 60 requests over a comparable 24-hour window — 0.58% of traffic — against 284 from AI crawlers (2.7%). Across 7 days: 403 Googlebot requests (0.68%) versus 1,530 from AI systems (2.6%). AI crawlers as a group are now running at roughly 4× Googlebot’s crawl frequency on this site. Googlebot has historically been the dominant automated presence in most server logs. On imagewize.com, AI crawlers have already overtaken it by request count — which says something about how aggressively these systems are indexing the web, even at SME scale.

The daily and weekly percentages are remarkably stable at 2.6–2.7%, which suggests these crawlers run on regular schedules rather than bursty events. Two things stand out across the week: Meta’s meta-externalagent consumed the most bandwidth per request by a wide margin — 12.79 MB for only 129 requests, suggesting it is pulling large pages or following redirects aggressively. And the numbers confirm what the 24-hour window already showed: we are a long way from 50%.

That does not mean the bot traffic problem is overstated. Large, high-traffic sites attract disproportionate crawler attention. A site publishing news or documentation with millions of daily visitors is a much more interesting crawl target than an SME web agency site. If your site is in the same weight class as ours — a focused business site, a few hundred real visitors a day, a solid content library — the realistic picture is probably closer to what we see.

Who Is Actually Crawling — and Why It Matters

Not all AI crawlers are doing the same thing. The distinction matters when you are deciding whether to block them.

GPTBot is OpenAI’s training crawler — it scrapes content to improve future models. ChatGPT-User is different: it visits your pages on behalf of users asking live questions in ChatGPT’s browsing mode. If it shows up in your logs, your content is being used to generate AI answers in real time, not just stored for future training. PerplexityBot works the same way. These live browsing agents represent potential referral traffic — if a user follows a citation link from Perplexity or ChatGPT, they arrive on your site.

ClaudeBot topped our 24-hour count. Worth noting: when a user on claude.ai asks Claude to look at a page on your site, that request also comes from Anthropic’s AWS infrastructure with the ClaudeBot user agent — indistinguishable from the autonomous training crawler in the logs. We use claude.ai ourselves to reference imagewize.com pages during development, so some of those hits are almost certainly user-initiated rather than autonomous. Checking the actual IP ranges helps confirm the traffic is from Anthropic’s infrastructure (AWS Ohio and Virginia), but it cannot tell you which type of ClaudeBot request it is.

What Pages Are They After

The 7-day page breakdown is more revealing than 24 hours. The top scraped pages were:

  • /robots.txt — 33 hits, confirming most crawlers do check it before crawling
  • Our post on building an AI plugin for Elayne — top content hit by a clear margin
  • A specific technical fix post (WordPress 6.9.1 block style issue in Sage)
  • Our AI workflow transparency post
  • The /themes/ page and individual theme pages
  • Every major category page (/category/wordpress/, /category/woocommerce/, /category/seo/, etc.) — each hit exactly 4 times

The pattern is clear: AI crawlers are drawn to technical content with specific problem/solution framing and AI-adjacent topics. The category pages getting uniform hit counts (exactly 4 each) looks like a structured crawl pass — a system mapping out what a site covers before going deeper. Generic service pages and contact pages barely appeared. If you are thinking about AI citation potential, specific technical posts outperform broad landing pages.

How to Check Your Own Logs

If you are running Nginx on a Linux server, the simplest starting point is a one-liner over SSH:

ssh user@yourserver.com "grep -oE 'GPTBot|ChatGPT-User|ClaudeBot|anthropic-ai|Google-Extended|PerplexityBot|CCBot|Bytespider|Amazonbot|meta-externalagent' /path/to/access.log | sort | uniq -c | sort -rn"

We built a more complete analysis script — ai-bot-monitor.sh — that breaks down requests per crawler, pages scraped, bandwidth consumed, hourly distribution, robots.txt compliance, and a cross-check for operator IP ranges. It runs alongside our existing traffic and security monitoring on a weekly basis and is available in our open tooling.

Should You Block AI Crawlers

This is a judgment call. The answer is not the same for every site.

The case for blocking

  • AI systems consume your bandwidth without paying for it
  • Your content may end up in model training data without attribution or compensation
  • If AI-generated answers replace direct clicks to your site, you lose traffic

The case for leaving it open

  • ChatGPT-User and PerplexityBot are live browsing agents — blocking them may mean your site never appears in AI-generated answers that could send you traffic
  • GPTBot has a documented opt-out via robots.txt — well-behaved crawlers respect it
  • Hard blocking in Nginx catches everyone but requires ongoing maintenance as new crawlers emerge

If you want to opt out of training crawlers specifically, the major ones have documented user agents. Add to your robots.txt:

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

Google-Extended is worth understanding specifically: it controls Gemini training data, but is completely separate from Googlebot. Disallowing it has zero effect on your Google Search rankings. It is one of the cleaner opt-outs available.

Take a Step back

Once you start looking at the full bot picture, AI crawlers stop being the headline. On the same 24-hour window, AhrefsBot made 581 requests on its own — more than double all AI crawlers combined. Add in SemrushBot, MJ12, and the rest of the SEO audit category and you are looking at roughly 7.5% of traffic from tools that were auditing, not indexing, our site.

Then factor in PetalBot (Huawei’s search crawler at 3.8%), AwarioBot, UptimeRobot, and assorted search engines from Yandex to Sogou — total identified bots on that day came to around 20% of requests.

Genuine human visitors were probably closer to 60–70% of the total. AI crawlers are the new entrants getting all the attention, but they are a fraction of a bot ecosystem that has been running quietly in your logs for years.

Our Current Approach

We are watching and not blocking. The traffic volume is low, the bandwidth cost is negligible at our scale, and at least some of these crawlers (ChatGPT-User, PerplexityBot) bring real-time referral potential. We will revisit if volume spikes or if CCBot starts appearing aggressively — Common Crawl feeds a large number of open-source and commercial AI training datasets and was notably absent from our logs this week.

The more interesting question is whether AI-generated answers are reducing organic traffic from humans. That is harder to measure from server logs. Checking Google Search Console for impressions trends on informational queries is probably more revealing for that specific concern.

Numbers first. Decisions second.

What does this mean for your site?

If you have been worried about AI bots destroying your performance or traffic, the data — at least at SME scale — suggests the concern is currently manageable. Monitor what is hitting your server, know the bandwidth cost, understand which pages are targeted, and be ready to act if the picture changes.

We handle this kind of server-level analysis as part of ongoing WordPress maintenance for our clients. If you want eyes on your own logs — or a properly configured server that keeps the right things in and the wrong things out — get in touch.

Post social media image by ThisIsEngineering: https://www.pexels.com/photo/code-projected-over-woman-3861969/

Leave a Reply