WGet or cURL: Mirror Site from http://site.com And No Internal Access

ajstein asked:

I have tried wget -m wget -r and a whole bunch of variations. I am getting some of the images on http://site.com, one of the scripts, and none of the CSS, even with the fscking -p parameter. The only HTML page is index.html and there are several more referenced, so I am at a loss. curlmirror.pl on the cURL developers website does not seem to get the job done either. Is there something I am missing? I have tried different levels of recursion with only this URL, but I get the feeling I am missing something. Long story short, some school allows its students to submit web projects, but they want to know how they can collect everything for the instructor who will grade it, instead of him going to all the externally hsoted sites.

UPDATE: I think I figured out the issue. I though the links to the other pages were in the index.html page that downloaded. I was way off. Turns out the footer of the page, which has all the navigation links, is handled by a JavaScript file Include.js, which reads JLSSiteMap.js and some other JS files to do page navigation and the like. As a result, wget does not pick up an other dependencies because a lot of this crap is handled not on web pages. How can I handle such a website? This is one of several problem cases. I assume little can be done if wget cannot parse JavaScript.

My answer:


Unfortunately wget cannot parse JavaScript, so spidering such a site is quite difficult.

The good news is, search engines don’t generally parse it either, so they are most likely feeding slightly different content to search engines (which is a bad idea for other reasons) so that they can get their pages indexed. They have to feed search engines pages which are reachable without JavaScript if they want to actually be indexed. If this is the case, you can work around it by spoofing Googlebot with wget, such as:

wget --user-agent="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www. google.com/bot.html)" ...

A very few sites will actually check your IP address to see if you’re really Googlebot, but this is far less common than it should be.

Another thing to do is to check for the presence of a /sitemap.xml file and use it as a list of URLs to crawl. Some sites provide this file for Google and other search engines to use to spider their content, but nothing says you can’t also use it…


View the full question and answer on Server Fault.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.