View Single Post
  #30  
Old June 15th 21, 07:10 PM posted to rec.bicycles.tech
Jeff Liebermann
external usenet poster
 
Posts: 4,018
Default Sheldon Brown site

On Tue, 15 Jun 2021 04:12:29 -0700, sms
wrote:

On 6/13/2021 8:35 PM, Jeff Liebermann wrote:
It just finished downloading.
1.95 GBytes, 17,286 files, 4,196 folders.


I ran WinHTTrack as well.

5.56 GB (5,976,820,140 bytes)
54,979 Files, 29,485 Folders

In the log, at the end, it said "Panic: Too many URLs, giving
up..(100000)"


My guess(tm) is that HTTrack tried to download too many offsite links.

Set Options - Limits - Maximum Mirror Depth - (blank)
Set Options - Limits - Maximum External Depth - (blank)

If you punch the "help" button on the "Limits" tab page, you get:

Maximum mirror depth
Define how deep will the engine seek in the site. A depth of 3 means
that you will catch all pages you have indicated, plus all that can be
accessed clicking twice on any link
Note: This option is not filled by default, so the depth is infinite.
But because the engine will stay on the site you indicated, only the
desired sites will be mirrored, and not all the web!

Maximum external depth
Define how deep will the engine seek in external sites, or on
addresses that were forbidden.
Normally, HTTrack will not go on external sites by default (except if
authorized by filters), and will avoid addresses forbidden by filters.
You can override this behavior, and tell the engine to catch N levels
of "external" sites.
Note: Use this option with great care, as it is overriding all other
options (filters and default engine limiter)
Note: This option is not filled by default, so the depth is equal to
zero.

I left both settings blank (default) and it worked. To be honest, I
don't really understand the difference between these two settings and
how they work. The error message seems to indicate a problem with one
or both settings. The help button shows the default settings for each
tab and setting.

I changed only the following from their defaults:
Row Control - Number of Connections - 8
Row Control - Persistent Connections - [x]
Links - Get non-HTML files etc [x]
Everything else was set to default values.

To keep HTTrack on the site and not wandering all over the internet, I
noticed that the start directory had to be the same URL that was being
searching and could NOT rely on redirection from http - https. On
the Project page, make sure it's https and www as in:
https://www.sheldonbrown.com
That might also have been the problem.

There's no robots.txt file, so you can ignore that setting.

I noticed that HTTrack adds a comment to each page indicated that the
file had been downloaded with HTTrack. I probably should have turned
it off with:
Browser ID - HTML footer - (blank)

I'm not sure if it's easier to try and clean up what you have already
downloaded, or just start over. Offhand, starting over might be
easier and more reliable.

Good luck.
--
Jeff Liebermann
PO Box 272
http://www.LearnByDestroying.com
Ben Lomond CA 95005-0272
Skype: JeffLiebermann AE6KS 831-336-2558
Ads
 

Home - Home - Home - Home - Home