Mirroring a website with multiple connections in parallel
4 votes c/dokk Posted by zPlus β€” 4 votes, 4 comments β€” Source

I have the dokk webserver locally, and from it I generate a static website that I then upload to the website. My goal here is to have the website serve only static files.

The way I generate the static website is by “mirroring” it with wget: wget --mirror --page-requisites --adjust-extension --execute robots=off localhost:8080 and then I upload all the .html pages. My only issue with wget is that it downloads one URL at a time, and it’s taking between 1 and 2 hours for completing the job. I looked at aria2 for parallel downloads, but it only takes a list of URLs as input instead of following links. What tools exist for mirroring a website in parallel?

Have you tried out httrack or curl? You should be able to copy over the website in parallel and specify how many threads, iirc. I can dig into some examples if you want some help with that.

I think curl is like aria2 in that it can’t do recursive downloads (it only takes a list of URLs as input). I have tried httrack which looks perfect on paper, but it’s so slow and I don’t understand why. It won’t download more than 1 or 2 links per seconds, and I have tried all the flags possible for removing any throttling. On the positive side… wget2 as suggested by @jorgesumle seems to work fine!

Why don’t you generate the website on the server? That way you don’t have to upload it later. There is also wget2, which might be better (I don’t know). There is a Debian packaged called wget2 and the repo is here β†’ https://gitlab.com/gnuwget/wget2

Thank you for suggesting wget2! Looks like it’s really what I need! I didn’t know it but I’ve just tried it and it works perfectly. Just replacing the command from “wget” to “wget2” it recognized all the flags too!

Why don’t you generate the website on the server?

It’s a little server. It works great for serving files, but doing any computation on it will take 10x the time.