I have tried
wget -r and a whole bunch of variations. I am getting some of the images on http://site.com, one of the scripts, and none of the CSS, even with the fscking
-p parameter. The only HTML page is index.html and there are several more referenced, so I am at a loss.
curlmirror.pl on the cURL developers website does not seem to get the job done either. Is there something I am missing? I have tried different levels of recursion with only this URL, but I get the feeling I am missing something. Long story short, some school allows its students to submit web projects, but they want to know how they can collect everything for the instructor who will grade it, instead of him going to all the externally hsoted sites.
Include.js, which reads
wget --user-agent="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www. google.com/bot.html)" ...
A very few sites will actually check your IP address to see if you’re really Googlebot, but this is far less common than it should be.
Another thing to do is to check for the presence of a
/sitemap.xml file and use it as a list of URLs to crawl. Some sites provide this file for Google and other search engines to use to spider their content, but nothing says you can’t also use it…
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.