For my machine learning projects, I need data you just can’t buy.
This requires SPA (Single-Page App) web data extraction involving multiple clicks and page scrolling that
curl can’t handle. Headless Chrome puppeted by RDP (Remote Debug Protocol) is a brilliant solution for this. I also need multiple changing IPs to avoid being soft-banned or rate-limited. To that end, I present a weekend Docker project to facilitate this research.
Initially, I tried in vain to set up a network of headless Chrome containers, VPN containers, and proxies to coordinate and persist session affinity with scaling. It became convoluted quickly. A friend suggested to combine the Chrome and VPN components into a single container and scale those. That is a winning idea.
- The headless Chrome browser just works.
- Avoid detection of a headless browser.
- Chrome+VPN containers can be scaled.
- If the VPN is down, all traffic is blocked.
- WebRTC is completely blocked (no real IP leaks).
- A single proxy delegates Chrome+VPN connections.
- VPNs can be randomized dynamically to thousands of locations.
- The proxy maintains session affinity.
Here are the essential Docker components.
Hundreds of people have forked a version of OpenVPN for Docker, and one person forked it to work with NordVPN. That was forked again to add randomized servers and cron jobs in an Alpine Linux image. This is the fork I use to work with Chrome, but heavily modified for Ubuntu.
Docker Supervisor System (S6)
S6 seems to work okay. Since this container has a few services running (Chrome, OpenVPN, cron), a supervisor tree is required to restart services or kill the container outright. I designed the container to die if the VPN fails to authenticate (bad username or password), but otherwise restart the individual services.
The supervisor is downloaded from GitHub, not installed via apt, so I also verify the signature of the download in the Dockerfile for safety.
Randominzing VPN Servers
You can pass in a crontab schedule to acquire a new VPN server periodically. For example, setting
RECREATE_VPN_CRON=*/10 * * * * will connect to a random VPN server every 10 minutes. NordVPN has thousands of servers so it’s likely to avoid being soft-banned (though my use case is harmless).
IP Leaks and Browser Fingerprinting
WebRTC is a real-time communication protocol that effectively reveals your true IP address. A careless VPN+proxy setup may leak your real IP. This solution forces all traffic from the container through the VPN tunnel adapter. If the VPN is down or still negotiating, then traffic is blocked. Linux
iptables handles this.
Additionally, care must be taken to hide the fact that the browser is headless (a bot) so the site presents the original desktop content. For example, out-of-the-box Browserless/Chrome populates the User-Agent header with “HeadlessChrome”. See below.
The contents of the
version.json file give headlessness away. I’ve mitigated this problem in an init script, but in Puppeteer code you can also change it. This modification keeps the real version of Chrome but loses the “Headless” bit.
Additionally, a diff of the headers between a desktop browser and the headless Chrome browser shows a few more differences:
See this site for some programmatic anti-fingerprinting tips for your headless Chrome using WebDriver or Selenium. See an example below.
Primary Docker Image
With GitHub CI/CD actions, my Dockerfile is built into an image and published to both GitHub and Docker Hub. Sgama made me aware of this new GitHub feature. In my repo there is a
.github/workflows folder with a build script taking advantage of GitHub secrets to hold credentials in encrypted form. As a result, you can pull my image directly from
ericdraken/chrome-vpn. Here is a block diagram of the main container.
You may have noticed Privoxy listening on port 3001. This is to allow the use of the VPN directly if needed.
Proxy Frontend, Chrome+VPN Backends
With a modified version of HAProxy from eeacms/haproxy I’ve set up a Chrome cluster load-balancer on default port 3000 to balance Chrome RDP requests to as many Chrome+VPN instances as desired. NordVPN allows six simultaneous connections (another reason to get NordVPN). You can specify the load-balancing algorithm to either round-robin or source-affinity. My use case requires source-affinity and a cronjob changing the VPNs every few minutes.
Simultaneous VPN Connections
Here is a working example of six VPN connections up simultaneously in round-robin mode. Every time I curl
ipinfo.io through my HAProxy frontend, I receive different external IPs and hence apparent geolocations. This works because the VPN service (NordVPN) allows six simultaneous connections. My containers use random delays when setting up links across Chrome+VPN containers to avoid flooding the auth handshake.
While in my cloned repo, copy the
.env, populate, and run the following commands to recreate my example below.
docker-compose -f docker-compose-scale.yaml up --scale chrome-vpn=6
curl -x localhost:3001 ipinfo.io
My use case is specific and innocent enough, but feel free to fork my repo and make improvements if it can benefit your projects.