FETCHSITE - A recursive web mirroring tool
==========================================
Fetchsite can be used like wget to mirror web-trees for static
viewing. Fetchsites primary design goal was, to create a tool to
generate static versions of dynamic Zope websites. As such, fetchsite
is considered to be a Zope developers tool.
Fetchsite has a number of special features, which distinguish it from
comparable tools. Fetchsite does
1. recursive fetch a complete site. Save all documents to normalized
paths
2. intelligently process directory default views (aka index.html)
3. interpret CSS and JavaScript code
4. corretly process the tag and HTML redirects
5. support file redirection
6. fetchsite is XHTML-clean
Fetchsite also has a number of Shortcomings, which make it suboptimal
as a general wget replacement:
1. It only works on valid HTML pages. fetchsite will break,
sometimes horribly, if confronted with incorrect HTML
2. Fetchsite is specialized for Usage with Zope: it does recognize
index_html as index document instead of index.
3. It chokes on xml directives ( ... ?> constructs)
4. It has more Bugs than wget :-)
Several features are on the TODO list:
o parse bobo_exception headers
o remove duplicate files on redirects using the rewrite engine
o better JavaScript parsing
o somehow handle xml directives
o better error handling
o make filename normalization configurable
1. Pathname normalization
-------------------------
Fetchsite stores all downloaded dokuments in a directory hierarchy
mirroring the structure of the website. All destination paths are
normalized in the following manner:
o If a file redirection applies to the source URL, it is applied
o If no file redirection applies, every path component is processed
in the following way:
o The component is folded to lowercase and truncated to 8.3
characters
o The component is uniquified by appending letters or changing
the last letters of the name until it is unique
o The file extension is adjusted according to the content-type of
the document
2. Directory default views
--------------------------
If during pathname normalization a name collision document <->
directory is encountered, we assume, that the document is the
directories default view and save the document as index document in
the directory. For example, when downloading
http://no.where.org/this/is/a/test
http://no.where.org/this/is/a/test/again
fetchsite will find a document <-> directory collision at
'this/is/a/test' and will therefore store:
http://no.where.org/this/is/a/test -> this/is/a/test/index.html
http://no.where.org/this/is/a/test/again -> this/is/a/test/again.html
(assuming, both are html documents)
3. CSS and JavaScript paring
----------------------------
fetchsite employs very somple CSS and JavaScript parsing using
regular expressions. In CSS this is not realy a problem, since CSS
very seldom contains regions of plain text. In JavaScript, this
approach might uncorrectly scramble JavaScript code.
Therefore, JavaScript statements are *only* processed in