It was a simple job, creating a website using the tools and free hosting on Google Sites. Trouble is, there's no site backup tool and I don't like leaving the only copy of my precious pages sitting in 'the cloud.' Now imagine you're taking over a website project, or migrating to a new ISP or server; maybe you need to manage rising web-traffic by creating a 'mirror' of you main site. Maybe you're going out and about with no web connection, but need to take some content with you. All the tools I know are commercial, 'industrial strength' or non-Linux. Which is where the WebHTTrack utility comes in...
WebHTTrack is an 'offline browser', allowing you to download a website from the Internet to a local folder, complete with all it's sub-folders, images and other files. The program is free under the the terms of the GNU General Public License. I was aware of the original command-line program Httrack, but was pleasantly surprised to find the GUI version directly available from the Ubuntu repositories – you can install it from Synaptic, just search for “webhttrack” or get the .deb package file from the project website at http://www.httrack.com. Once installed, you should get a menu entry for WebHTtrack Website Copier. The launcher this creates will fire up WebHTtrack in a web-browser. It's not the prettiest set of forms you've ever seen, but it does provide a comprehensive step-by-step web interface for a very powerful tool. You define every site download as a project, from which WebHTTrack can later update existing mirrored sites and resume interrupted downloads.
At it's most basic, you only need specify the website address and a destination folder and WebHTtrack does the rest for you, the progress window counting through the files and folders downloaded. The program trawls the pages and all links it finds therein, compiling a list of contents and folders for download. When it's done you can navigate to your download folder, open a page of the mirrored website in your browser and browse the site from link to link as if you were viewing it online. In theory it is possible to dummy-run the tool for an estimate of files and storage required, but don't rely on this for dynamically generated sites.
This works very well for simple sites and it achieved my objective of backing up my Google Sites pages and my company site. Offline browsing works very well for these.
The downside? Offline browsing will work fine for flat HTML, but you will need appropriate web-server technologies installed for scripting, JAVA, PHP or other server-side includes to work.
More importantly, this extremely useful program comes with moral and technical hazard and it's worth re-emphasising the sound moral guidance of the WebHTtrack team themselves.
'Just because you can, doesn't mean you should...'
- Do not steal private information
- Do not grab emails
- Do not grab private information
- Ensure that you can copy the website:
- Are the pages copyrighted?
- Can you copy them only for private purpose?
- Do not make online mirrors unless you are authorized to do so
Also, don't think this program is a license for bandwidth abuse and other bad behaviours;
- Do not overload websites you are copying (consider moral hazards, above); downloading a site can overload its' bandwidth limit, particularly if you capture too many dynamically generated pages. On the flip-side, save yourself the embarassment of trashing your own bandwidth and local storage! Therefore:
- Do not download large websites in their entirety: use filters
- Do not use too many simultaneous connections
- Set limits in the program: bandwidth limits, connection limits, size limits, time limits
- Only disable robots.txt rules with great care
- Try not to download sites during working hours
- Check your mirror transfer rate/size
- For large mirrors, first ask the webmaster of the site if that's not you!
A really powerful and effective little program that suits a number of tasks. You really don't want to run this 'blind'; websites use many platforms and technologies, so save yourself frustration and potential embarassment - check the user guide on the project website as there are lots of settings you may need to understand before you run this. RC