Puppet Several Headless Chrome Instances Behind Different VPNs in Docker, No IP Leaks

Goal: On a single machine, orchestrate half a dozen or so headless Chrome instances connected to random VPNs for global web surfing and content extraction, and all with a single Docker compose file.

For my machine learning projects, I need data you just can’t buy.

This requires SPA (Single-Page App) web data extraction involving multiple clicks and page scrolling that curl can’t handle. Headless Chrome puppeted by RDP (Remote Debug Protocol) is a brilliant solution for this. I also need multiple changing IPs to avoid being soft-banned or rate-limited. To that end, I present a weekend Docker project to facilitate this research.

Chrome VPN featured

Chrome+VPN

Initially, I tried in vain to set up a network of headless Chrome containers, VPN containers, and proxies to coordinate and persist session affinity with scaling. It became convoluted quickly. A friend suggested to combine the Chrome and VPN components into a single container and scale those. That is a winning idea.

Project Objectives

  • The headless Chrome browser just works.
  • Avoid detection of a headless browser.
  • Chrome+VPN containers can be scaled.
  • If the VPN is down, all traffic is blocked.
  • WebRTC is completely blocked (no real IP leaks).
  • A single proxy delegates Chrome+VPN connections.
  • VPNs can be randomized dynamically to thousands of locations.
  • The proxy maintains session affinity.

Here are the essential Docker components.

FROM Browserless/Chrome

Browserless.io maintains a paid service for headless Chrome automation. They also open-sourced a well-maintained Ubuntu-based Docker image that just works, fonts and all. I use this as my base image.

Chrome+VPN Node Puppeteer demo on default port 3000
Chrome+VPN Node Puppeteer demo on default port 3000

FROM NordVPN

Hundreds of people have forked a version of OpenVPN for Docker, and one person forked it to work with NordVPN. That was forked again to add randomized servers and cron jobs in an Alpine Linux image. This is the fork I use to work with Chrome, but heavily modified for Ubuntu.

I used to use TorGuard, but they only have two Canadian servers and a handful of American servers. I tried to make the NordVPN image work with TorGuard, but it was by far more convenient to sign up with NordVPN with its thousands of servers. They gave me an affiliate link, so if enjoy this article, consider using my link. Thanks!

Docker Supervisor System (S6)

S6 seems to work okay. Since this container has a few services running (Chrome, OpenVPN, cron), a supervisor tree is required to restart services or kill the container outright. I designed the container to die if the VPN fails to authenticate (bad username or password), but otherwise restart the individual services.

Docker top command on the Chrome+VPN container
Docker top command on the Chrome+VPN container

The supervisor is downloaded from GitHub, not installed via apt, so I also verify the signature of the download in the Dockerfile for safety.

Randominzing VPN Servers

You can pass in a crontab schedule to acquire a new VPN server periodically. For example, setting RECREATE_VPN_CRON=*/10 * * * * will connect to a random VPN server every 10 minutes. NordVPN has thousands of servers so it’s likely to avoid being soft-banned (though my use case is harmless).

IP Leaks and Browser Fingerprinting

WebRTC is a real-time communication protocol that effectively reveals your true IP address. A careless VPN+proxy setup may leak your real IP. This solution forces all traffic from the container through the VPN tunnel adapter. If the VPN is down or still negotiating, then traffic is blocked. Linux iptables handles this.

Additionally, care must be taken to hide the fact that the browser is headless (a bot) so the site presents the original desktop content. For example, out-of-the-box Browserless/Chrome populates the User-Agent header with “HeadlessChrome”. See below.

Revealing the bot-nature of the browser
Revealing the bot-nature of the browser

The contents of the version.json file give headlessness away. I’ve mitigated this problem in an init script, but in Puppeteer code you can also change it. This modification keeps the real version of Chrome but loses the “Headless” bit.

Desktop User-Agent string
Desktop User-Agent string

Additionally, a diff of the headers between a desktop browser and the headless Chrome browser shows a few more differences:

Diff of headers between Desktop Chrome and headless Chrome
Diff of headers between Desktop Chrome and headless Chrome

I added a default language header, but simple JavaScript headless browser detection tests show there is more work to be done.

Tests to detect the headless Chrome browser
Tests to detect the headless Chrome browser

See this site for some programmatic anti-fingerprinting tips for your headless Chrome using WebDriver or Selenium. See an example below.

Passing simple headless browser detection tests
Passing simple headless browser detection tests

Primary Docker Image

With GitHub CI/CD actions, my Dockerfile is built into an image and published to both GitHub and Docker Hub. Sgama made me aware of this new GitHub feature. In my repo there is a .github/workflows folder with a build script taking advantage of GitHub secrets to hold credentials in encrypted form. As a result, you can pull my image directly from ericdraken/chrome-vpn. Here is a block diagram of the main container.

Chrome+VPN Docker container
Chrome+VPN Docker container

You may have noticed Privoxy listening on port 3001. This is to allow the use of the VPN directly if needed.

Proxy Frontend, Chrome+VPN Backends

With a modified version of HAProxy from eeacms/haproxy I’ve set up a Chrome cluster load-balancer on default port 3000 to balance Chrome RDP requests to as many Chrome+VPN instances as desired. NordVPN allows six simultaneous connections (another reason to get NordVPN). You can specify the load-balancing algorithm to either round-robin or source-affinity. My use case requires source-affinity and a cronjob changing the VPNs every few minutes.

HAProxy with Chrome+VPN Docker layout
HAProxy with Chrome+VPN Docker layout

Simultaneous VPN Connections

Here is a working example of six VPN connections up simultaneously in round-robin mode. Every time I curl ipinfo.io through my HAProxy frontend, I receive different external IPs and hence apparent geolocations. This works because the VPN service (NordVPN) allows six simultaneous connections. My containers use random delays when setting up links across Chrome+VPN containers to avoid flooding the auth handshake.

While in my cloned repo, copy the .env.tmpl to .env, populate, and run the following commands to recreate my example below.

Six simultaneous VPN connections
Six simultaneous VPN connections

Source Code

My use case is specific and innocent enough, but feel free to fork my repo and make improvements if it can benefit your projects.

Chrome+VPN GitHub repo
Chrome+VPN GitHub repo
Success: With a single Docker compose file on a single machine I am able to instantiate six headless Chrome instances connected to six random NordVPN servers with a single load balancer to assist my machine learning data-sourcing efforts.