Unblock Google Analytics: Prevent AdBlockers from Blocking Site Analytics

In a Nutshell

I show you how I (used to) un-adblock Google Analytics to get full visitor insights on my pages through some zany techniques. I thought about patenting these, but it’s more fun to share. I can defeat my own techniques, and the community will likely catch up. Until then, we can enjoy visitor analytics despite regex-based adblockers and DNS blockers (e.g. Pi-Hole).


Goal: Prevent Google Tag Manager and Google Analytics from being blocked by adblockers and DNS blockers again by using URL-encoding and same-domain proxying, and WAF-bypass techniques. Get nearly 100% of visitor analytics and stop losing 25~30% of visitor data due to adblockers and DNS blockers.
Update: The community has come together and built some tailor-made blocking filters just for me. Some of them are pretty clever. The knowledge in this post had a good six-month run. I’ll let the adblock filters stand and not try to circumvent them with script-inlining. I’m glad the community of adblockers exists. Please enjoy the read as it originally was, and the update on how they blocked my scripts.

Consider the following network traffic:

Google Tag Manager is blocked
Google Tag Manager is blocked

I use adblockers in the browser and at the DNS level with Pi-Hole and pfSense. Millions of people do as well because ads are abnoxious and even malicious.

However, I’m amenable to a website collecting anonymous visitor behaviour information with Google Tag Manager. I use it. But, it’s blocked in uBlock and pfSense as a well-known tracking domain. I’ll explore how to prevent Google Tag Manager from being blocked.


Obviously, we need to cloak or hide the googletagmanager.com domain (and more). But,

My websites are serverless, so server-side scripts cost money per execution1.

I cannot host a server-side script in PHP or ASPX or Node because I do not have a server. What are some ideas to unblock Google Tag Manager and Google Analytics?

Replace the Domain with an IPv4?

We can replace the domain with the IPv4 address in theory, but that only bypasses regex-based browser blockers and DNS-lookup blockers; pfSense and Pi-Hole also use community lists that block IP addresses in a game of whack-a-mole. Also, that domain may have many, many A and AAAA records that change.

Problem: Google uses virtual hosting on their IPv4, so a Host header needs to be set which is impossible (for the bulk of visitors) since 2015. Also, HTTPS will fail because an IP address is not in the certificate SAN.

Replace the Domain with an IPv6?

In North America, virtually everyone I know has IPv6 disabled in their router, PC, Linux server, you name it. While replacing the domain with the IPv6 address would bypass the IPv4 community blocklists, too many people would naturally block GTM.

Problem: Too many people have IPv6 turned off.

Copy and Paste the Core Google Tag Manager JavaScript?

I like your thinking. However, even though the GTM JavaScript doesn’t change very often, it does change. Also, the script needs to phone home via a WebSocket or invisible pixel or XHR request or iframe. Those calls home will likely be blocked as they are obvious in plaintext. Replacing the domains with the IPs has the same problems as the previous ideas.

Problem: The problem hasn’t changed: the blocked domains like googleanalytics.com are still in the client-side code.

Proxy the Google Tag Manager Communication Through a Worker?

Preprocessing the GTM and Analytics scripts server-side in an AWS Lambda or Cloudflare Worker would allow the whole Google Tag Manager core to be fetched, deobfuscated, and have offending domains replaced with Lambda endpoints that proxy the requests to Google (via XHR or invisible pixels, not WebSockets). The co-opted GTM code would then be cached and sent client-side. No one would be the wiser.

Cost: Each interaction with Google Tag Manager would cost one Worker invocation. For example, if you scroll the page, that might be four XHR calls. Loading and unloading the page incurs another two calls. With Cloudflare Workers, you get 100,000 invocations per day across all your sites. This will be exhausted faster than you realize.
Shared Hosting: This server-side proxying is trivial to implement on a shared host with PHP and ASP scripts if you are still on a slow, overpriced, shared host.

Proxy the Google Tag Manager Communication Through Redirects?

How about co-opting the GTM code with the previous idea, but replacing the outgoing Google domains with a non-existent path(s) on the same website that redirects the XHR or pixel requests back to Google? Let’s see if that would even work.

First, I’ll whitelist .googletagmanager.com.

Whitelist googletagmanager.com in pfSense
Whitelist googletagmanager.com in pfSense

Now, let’s see if GTM loads:

GTM loads whitelisted in pfSense
GTM loads whitelisted in pfSense

We find a lot of minified code:

We unminify and inspect it to find the outgoing URLs or domains.

Unminify the JavaScript with a JetBrains IDE
Unminify the JavaScript with a JetBrains IDE

Guess what? All this script did was load another script: analytics.js. Be sure to whitelist:

Here is a serach for the outgoing domains in the script.

Analytics.js domains found in the script
Analytics.js domains found in the script

The requests back to Google are often POST requests. Here is an example of a page-load event:

Problem: The IPs with no Host header on a dedicated server would work fine; the IPs for Google Analytics are virtual-hosted, so a Host header is needed. Requests to the IPs fail silently because Google Tag Manager hides browser errors very well.

Perform DNS Lookups then Replace Domains with IPs?

Let me try an experiment…

Dig of Google Analytics domain
Dig of Google Analytics domain

Now, I’m behind pfSense so all DNS requests go through Unbound and I cannot dig @8.8.8.8, for example. I’ll perform a DNS lookup on my phone while specifying DNS servers (like 8.8.8.8 and 1.1.1.1) and reproduce the results below:

  • 142.250.217.78
  • 142.250.69.206
  • 142.250.217.78
  • 142.250.69.206
  • 142.250.81.206
  • 142.251.40.174
  • 142.250.64.110

We see that the CNAME for www.google-analytics.com flattens to an IP that varies across time with a short TTL. It’s conceivable that if a Worker or Lambda could perform the DNS lookup itself and replace domains with IPs, we will sail past regex and blocklists.

Forshadowing: This plan would have worked up until about 2014 over HTTP. As stated in the conclusion of previous brainstorming sections, the IPs for Google Analytics are virtual-hosted, so a Host header is needed.

Sub Goal: Obfuscate and proxy Google Analytics events while using the fewest Worker or Lambda invocations possible. With Workers, we get 100,000 invocations per day across all sites. With Lambdas, we are charged on a per-ms basis.

A Strategy to Rewrite Analytics URLs

Let’s brainstorm ways to prevent Google Tag Manager from being blocked.

  • We could search for XMLHttpRequest and hook into those objects.
    • We cannot set a Host header since 2015.
  • We could search for https:// and rewrite that/those to just /gtm/https/. This would make all outgoing requests hit https://ericdraken.com/gtm/ which is a local domain. From there, a Worker can proxy the POST to the remaining URL path (converted back to a URL).
    • Such a proxy Worker will get blasted and the quota will get depleted quickly.
    • Workers allows 100,000 calls per day, 10ms per execution.
    • Lambdas charges per millisecond of execution.

Let’s pull on one of these threads some more. Given a URL like

https://ericdraken.com/gtm/https/www.google-analytics.com/j/collect/…

The event URL can be reconstructed. Could this still be blocked by regex lists? Maybe. However, it would be easy to make an S3 redirect rule like:

To avoid regex blocking in the browser, if we hook into the XMLHttpRequest protoype for open(), we could encode the www.google-analytics.com domain like so:

https://ericdraken.com/gtm/wgac/j/collect/…

A modification to the S3 redirect would then be:

Problem: S3 responds to the client with a client-side redirect and it will be blocked by regex and DNS blockers. Dang. Free S3 redirects are out.

Server-Side Analytics Proxying with AWS Lambdas

Why aren’t Lambdas my go-to for Serverless dynamicism? AWS Lambdas are cold-start, meaning they respond slowly. You can pay more for warm-start, but it’s a big jump.

But, do we need zippy analytics communication? Suppose we fire an analytics event to a cold-start Lambda and it takes 5 seconds to warm up and process, do we care? No. We can fire and forget. In that case, let’s look at AWS Lambda pricing again.

AWS Lambda Pricing

Duration cost depends on the amount of memory you allocate to your function. You can allocate any amount of memory to your function between 128 MB and 10,240 MB, in 1 MB increments. (ref)

We have a choice of ARM or x86 architectures. Honestly, the Lambda pricing is a Gong Show so I will focus on the most significant metric to ballpark the cost of this strategy.

Memory (MB)Price per 1ms
128$0.0000000021

If we use a light language like JavaScript, proxy and ignore the response, and essentially treat analytic events like UDP broadcasts, then we can write a damn-fast Lambda proxy and maybe pay nothing?

Let’s see.

5 ms per request * 1000 events/day * 31 days = 155,000 ms

That’s just under 3 Lambda-minutes. Then, the pricing would be:

155,000 ms * $0.0000000021/ms = $0.0003255

That’s not even a penny. In fact, by reversing the free pricing limit, we get

$0.0049 = $0.0000000021/ms * x ms

where x is 2,333,333 ms, or 39 Lambda-minutes for free. That’s a lot for free!

Ok, let’s write a Lambda to proxy Google Tag Manager events to bypass adblockers.


Exploring Lambdas

I’ll head over to the AWS web console and create a new Lambda with a public HTTPS endpoint.

Create a new Lambda with a public URL
Create a new Lambda with a public URL

I’ll code this in Python 3.9 on ARM architecture as a PoC. If it is slow, then I can come back and code it in Node (vanilla JavaScript is not available).

I end up with a public Lambda URL like this:

https://e4hjkz6trtubruicifxemin2kq0kdsyi.lambda-url.us-west-2.on.aws/

When I run the Lambda with the default hello-world Python 3 script, I get:

Test results

Nice. Now, we have a public Lambda that anyone can hit that runs some newer Python. It’s up to us to protect the Lambda with referrer checks and origin IP checks with some creativity. We can also see that I was just billed for 1 ms of Lambda time. Here is what we know so far:

  • 36 MB of RAM was used to print “Hello from Lambda!”
  • 100 ms was spent on warm-up (not billed).
  • Execution took 0.95 ms.
  • Billing was 1 ms.

Let me run this again and see what billing looks like:

Findings: Even though the duration was 1.03 ms, I was billed for 2 ms.
Problem: To reach a Lambda at a public URL – no matter how cryptic – exposes a point of failure where Google Analytics can be adblocked by banning that Lambda endpoint in someone’s adblock list, however unlikely, but it is a possibility.
Problem: A Lambda proxy would be on a different domain and is subject to cross-domain CORS issues.

There are big drawbacks with using Lambdas for hiding Google Analytics.


Google Analytics Proxying with a Cloudflare Worker

For this proof-of-concept, I’ll write some Worker code in vanilla Javascript. It is nice that we have breathing room with 10 ms Cloudflare gives us to run the warm-start Workers.

Forshadowing: It was only later that I found out that the Host header cannot be set in XHR requests since about Firefox 43 (2015), so DNS lookups to replace a blocked domain with its IPv4 is a blind alley. I leave this section here because it is still neat. Feel free to skip this section, but all failures are life lessons.

Programmatic DNS Lookups

Here is a simple DNS lookup script.

We see it calls the 1.1.1.1 DoH DNS lookup endpoint and returns some JSON information containing the A records we seek.

How long did this take?

This doesn’t tell me much, so I modified the Worker to add timing information which I then ran a few times:

Two DNS lookups with timing information
Two DNS lookups with timing information

Performing two DNS lookups is fairly quick. Next, we have to get the Google Tag Manager code and Analytics code.

Debugging: Cloudflare makes it easy to debug Worker scripts. Take this gaff, for example:

Cloudflare Worker debugging and trace
Cloudflare Worker debugging and trace


Phase One: Hijack Google Tag Manager

Before getting into a working script, here is the idea for coopting Google Tag Manager:

Replace blocked domains with their most recent DNS A or flat-CNAME IP
Replace blocked domains with their most recent DNS A or flat-CNAME IP

Again, IP-replacement is not a viable solution. What is valuable is the insight into the timings of portions of Worker scripts to see how we can stay under 10ms.

The most expensive operation so far is actually getting the source of the GTM script.

Good News: We can ask Cloudflare to cache the fetch of GTM. Please note that your site’s GTM ID is hard-coded in the gtm.js.

Now look at the timings:

Before continuing, let’s see if I can load GTM with my customizations from my own domain as a checkpoint.

Load Google Tag Manager from a custom domain
Load Google Tag Manager from a custom domain

Excellent.

Gotcha: Inline Script Tags

Nice try, Eric. But just like shared servers, a request sent to a virtual-hosted IP must include the hostname or else the default virtual host may not be the intended service.

Failed Google IP replacement
Failed Google IP replacement

What happened here is that Google Tag Manager created a <script> tag and set the src to https://142.250.191.46/analytics.js which did not hit the service that provides the analytics script.

What can we do?

Idea: We can intercept and fetch the <script> src contents in the Worker and then load and eval() them in plaintext in the DOM.
Eval: Why am I willing to run eval‘d Google Analytics code in my sites? My sites are serverless and not prone to XSS vulnerabilities.
Serve Local Analytics.JS: Why not copy and serve a local copy of analytics.js? The canonical resource has a max-age of 7200 which means it is cached for 2 hours; we cannot be certain it doesn’t change often.

Phase Two: Hijack Google Analytics Inline Scripts

With our interesting problem at hand, we need to coopt both the inline script mechanism and the XMLHttpRequest to rewrite some offending URLs.

This is the desired XMLHttpRequest hijacking which is as easy as paint-by-numbers by replacing the prototype.open:

Intercepting Google Analytics requests
Intercepting Google Analytics requests

Rewrite Blocked Domains to Obfuscated Strings

Here is an idea of the concept:

Why this odd rewrite flow?

A script with the name analytics.js may be blocked by regex just like the infamous fingerprint2.js is regex blocked. Same for the www.google-analytics.com string in case the blocking regex is improperly written. Those telltale strings are base64-encoded. A known delimeter such as /js/ separates or prefixes those encoded strings. Any intermediary path segments and the query string sail through in the clear.

We can do even better.


Rewrite Blockable Strings to Encoded Strings

Here is the rewrite logic that I actually just in production:

To decode and fetch the URLs safely, the Worker listens on /a/* for JavaScript source files denoted by /a/js/. From there, every path segment is atob()-decoded. The result may be garbage if the segment is not base64-encoded, so we check the first character for the telltale, in this case a single !.

The plan is to keep a list of known blockwords like analytics.js and gtm.js and replace them in the GTM code. The same goes for blocked domains. Let’s see it in action:

Better solution to encode blocked URL patterns
Better solution to encode blocked URL patterns
Google Analytics loading with encoded URLs
Google Analytics loading with encoded URLs

As we can see, gtag and analytics.js load via inline scripting. XHR requests are also being proxied. However, as a hint of things to come, it will be nice to tell the difference between inline <script src=..> requests, and XHR requests on page events.

Regex Blocking: To convince you that regex blocking is happening, consider this network request for gtm.js in an ecoded URL. Notice how it is still blocked.

Notice how regex blockers target gtm.js
Notice how regex blockers target gtm.js

Gtm.js: Do not encode the string “gtm.js”, but instead encode “/gtm.js” (notice the leading slash). The reason is because “gtm.js” is also the name of an event – GTM loaded – but “/gtm.js” is the script name we want to hide.

Phase Three: Intercept Google Analytics XMLHttpRequests

Since 2015, XMLHttpRequest has been hardened against executing XSS exploits by forbidding certain header overrides.

Just for Completeness: This is what my Analytics request interceptor would do if it could set the Host header: it replaces certain domains with their recent DNS lookup IPs and “sets” the Host header.

Additionally, to make this work, I’d have to downgrade requests to HTTP because there is no SAN entry for an IP address.

In case you are wondering, no, we cannot ask XHR requests to go through an actual, commercial proxy (one with a password, etc.).

Let’s turn back to our old standby for bypassing CORS errors and proxying requests, next.


Phase Three Redux: Proxy XMLHttpRequests

What will work is proxying every Analytics event through a PHP script (Apache server) or ASPX script (IIS) for free. My websites are serverless, so server scripts cost a fee per invocation.

Here is what the results of such proxying look like:

Google Analytics server-side proxying
Google Analytics server-side proxying

Wonderful.


Gotcha: Google Tag Assistant’s Preview Mode

When you preview Tags, you enter into your website with flags and side-channel communications; this can play havoc with our proxy setup:

  • For one, GTM needs to add gtm_debug= to the page URL and also the inline GTM script.
Google Tag Assistant modifies your page in Preview Mode
Google Tag Assistant modifies your page in Preview Mode
  • For another, GTM in Preview Mode will override the dataLayer window object and also change the web property ID on you (for debugging).
  • Still another, Preview Mode may inject the gtag.js or gtm.js multiple times, so we must place a guard around the XMLHttpRequest hijacker so it only inserts itself once.

  • Yet another, in Preview Mode, we may have to proxy small images and CSS: we cannot take a guess at the content type but should be as robust as possible when proxying responses.

  • Penultimately, when coopting gzip– or Brotli-compressed scripts, we must be sure to remove these compression-related headers because the proxied response to the client will be uncompressed. Not paying attention to this point may result in strange characters resulting in side effects that are hard to debug. Trust me.

  • Finally, care must be taken to not allow random or bad input, or for hackers to use your Worker as a proxy service for bots. Also, prevent allowing URLs with thousands of characters that may overwhelm your Worker quota.

Try to prevent huge or bad input
Try to prevent huge or bad input

Gotcha: Web Application Firewall (WAF) Blocks Long Encoded Strings

And you thought this was going to be easy. WAF will drop requests with long, base-64 encoded strings because such URLs are telltale signs of an encoded web shell being operated, or someone trying to egress company information. Then, how to bypass adblockers and the WAF? Is this not entertaining of a problem? This gotcha and the next gotcha will be overcome, shortly.


Gotcha: Dynamic URLs and Navigator.sendBeacon()

Consider the following code snippet:

Piecemeal URLs that cannot be grepped
Piecemeal URLs that cannot be grepped

We already rewrite the Google Analytics source to encode inline script tags, iframes, and tracking pixels. We also trap classic XHR requests. We’ll need to expand the XHR interceptor to rewrite some dynamic URLs on the fly (e.g. https://+a+google-analytics.com).

New XHR: There is a new, built-in, asyncronous method to send analytics data to a server: We must also trap Navigator.sendBeacon() as well as XMLHttpRequest to encode Google URLs and blockwords on the fly.

Or do we?

Should we play whack-a-mole with emerging browser technologies (e.g. sendBeacon), or be future-proof and permanently bypass adblockers?


Future-Proof URL Encoding, or Fantabulous Shenanigans

Let’s use random words instead of base64 strings and bypass the WAF: the Web Application Firewall keys off large base64 strings and make a call to drop the connection or not. Here is what using random words looks like:

Unusual words as encodings instead of base64 strings
Unusual words as encodings instead of base64 strings
Random words instead of telltale strings
Random words instead of telltale strings

If you are zany like me, you can make the random words polymorphic so they do not repeat for a long time; use Cloudflare KV or an S3 bucket of words in “folders” denoted by the day of the month (1~31). Good luck, regex-based adblockers.

Example URLs:

https://ericdraken.com/a/tribblekittyhawk/g/collect?v=…
https://ericdraken.com/e/fantabulous/shenanigans?v=…
https://ericdraken.com/z/fantabulous/kittyhawk

Here are similar, polymorphic URLs captured in DevTools:

Polymorphic anti-adblock URLs
Polymorphic anti-adblock URLs

Demonstration

Here are some screenshots demonstrating this technique of preventing adblockers and DNS blockers from blocking Google Analytics.

GTM and GA4 are working through my Worker proxy
GTM and GA4 are working through my Worker proxy
We're doing very well on CPU time
We’re doing very well on CPU time
New GA4 visitor interactions graphs
New GA4 visitor interactions graphs
More GA4 visitor interactions with adblocking bypassed
More GA4 visitor interactions with adblocking bypassed

Bonus: Track Visitors Who Use Adblockers

Allow me to demonatrate my technique of un-adblocking and un-DNS-blocking Google Analytics by testing if the visitor has an adblocker or not, and see if that person interacts with my site, regardless. First, what is a telltale sign of an ad? Let’s check out the blocklist in uBlock Origin:

Any of these ids will trigger an adblock event
Any of these ids will trigger an adblock event

Then, add some quick JavaScript to send an event when a RegEx adblocker is detected.

Results and Analytics

The results are exciting:

Users who use adblock and no adblock
Users who use adblock and no adblock

Note: The overlap is due to the adblocker-check script not being on 404 pages, yet.

Chrome is by far the leader in web browsers
Chrome is by far the leader in web browsers
How people move through my site
How people move through my site (since May)

Optimizations

Consider the following network graph:

Some requests can be optimized
Some requests can be optimized

You may be happy with 100ms for loading the gtag.js script on each page view with a 200 OK response, not 304 Not Modified, but I’m not. This is an artifact of not having ETag or timestamps in a Serverless website using Workers or Lambdas. We can overcome this with some clever Cloudflare KV store, or cache the Google scripts in S3 (but then we’d have to store S3 credentials in Cloudflare, which I am not prepared to do – total separation of concerns is desired.) For now, it’s a splinter in the mind which can be addressed, but it’s good enough for today.

Conclusion

Proxying isn’t new, but polymorphic-URL rewrites and recompiling Google Analytics scripts is a more clever mouse. That’s basically it: We rewrite obvious regex triggers inside Google scripts with polymorphic tricks to prevent regex-based adblockers from building a list of keywords against us.

Let’s look at this site, for example.

About 40% of my audience uses adblockers
About 40% of my audience uses adblockers

My site is just a fun, niche, nerd-out site, so I bet my visitors are savvy and use adblockers without a second thought – the above Venn diagram is evidence of that.

Success: We’re able to prevent Google Tag Manager and Google Analytics from being blocked by adblockers and DNS blockers by using polymorphic URL-encoding and same-domain proxying. We can get nearly 100% of visitor analytics and stop losing 25~30% of visitor data due to adblockers and DNS blockers.

Update 2022-11-04:

uBlock Origin and EasyList put out some neat filters just for me.

uBlock is getting aggressive
uBlock is getting aggressive

Let’s learn about adblocking and see what is going on here:

  1. They are trapping dataLayer so it blows up when it is read. Clever!
  2. They are removing all async scripts – usually analytics scripts use this.
  3. They made an effort to keep my site working.
  4. They are blocking all scripts and XHR requests – nuking the site by accident.

We had a chat, and the last rule is gone now. They’re pretty good people.

Success: The community came together, took notice, and made custom filters for this site. Nice. I could try and keep going and do more things, but I’ve moved on and am doing other projects now. Much respect to @Yuki2718 for showing me a few things. I hope you enjoyed the read.

Notes:

  1. Fractions of a penny, with a generous per-day free amount