Cloudflare blocking my script as a bot

Does anyone have suggestions for checking for broken links on 3rd party web sites using Robot Framework? Cloudflare is protecting the target sites from “bots”?

My web site has several dozen links on it to third party web sites. A few times each year the third party web site page changes/vanishes and I have a broken link to fix. I don’t want to continue to do this manually and I had the idea of writing a Robot Framework script to visit each site and detect broken links. I was thinking of running it about once/month. My plan was to use the Browser library Click keyword and then inspecting an item or two on the third party web page. When I go to one site, Cloudflare is detecting that it is a bot running against the site and blocks access. Totally reasonable behaviour on Cloudflare’s part. Any suggestions? E.g. I am running the browser headless, what if it was with a real browser on my computer?

Hi @northernHemisphere,

I don’t think there’s going to be a good answer, I’ve run into this type of problem myself, and not found a good solution.

  • One way you can get around it is to automate the browser with something like the Sikuli or ImageHorizon Libraries, the problem is they work on image recognition so identifying the links will be tricky unless you took screen shots of them earlier (not a great solution)
  • What you are doing is reasonable, you are automating site maintenance, so you could try contacting Cloudflare and ask them for their recomendation, but if you are not a Cloudflare customer yourself, it’s merely your affiliate that’'s a Cloudflare customer, they might not be particularly helpful.

You are running a real browser on my computer, a headless browser is still a real browser.
From my understanding the difference is because Browser Library / SeleniumLibrary run the browser in a “clean” session with not much history, cookies etc that’s part of the trigger, I was able to trigger Cloudflare’s bot protection by using Incogneto on a newly built machine with no automation tools

  • another option would be to detect the Cloudflare page, and then just do a DNS lookup of the domain for the link, if the DNS returns as valid, mark it as a pass (also not a great solution, as if their site changes you’ll still miss it)

Sorry I don;t have any good answers, hopefully this was a little helpful.

Dave
(from the Southern Hemisphere :wink: )

1 Like

Thanks for taking time to respond. I agree that there is no good answer.

I did some googling and there are various articles out there about things to try. BUT working hard to find some trick is fruitless because it will eventually be adopted by bad bots and then eventually detected/blocked by CloudFlare.

My solution is to automate what I can and have the script spit out a list of 3rd party sites to manually test for broken links.

1 Like

What is instead of trying to visit the linked page with a browser you do a GET request on that url instead? Does Cloudflare also block that request and, more importantly, if blocked, does Cloudflare respond the same for existing and removed websites? Even if GET requests on available sites are blocked, getting a 404 (instead of a “blocked” response) would tell you enough.

2 Likes

Not a bad idea. I tried it out but it doesn’t help.

Unfortunately CloudFlare returns 403 responses for both existing and non-existent pages on the site.

Unfortunate, but not entirely unexpected I guess. After all, parties like Cloudflare get paid to block automated / scripted access, so you’ll have a hard time working around that.

1 Like