FETCHSITE - A recursive web mirroring tool ========================================== Fetchsite can be used like wget to mirror web-trees for static viewing. Fetchsites primary design goal was, to create a tool to generate static versions of dynamic Zope websites. As such, fetchsite is considered to be a Zope developers tool. Fetchsite has a number of special features, which distinguish it from comparable tools. Fetchsite does 1. recursive fetch a complete site. Save all documents to normalized paths 2. intelligently process directory default views (aka index.html) 3. interpret CSS and JavaScript code 4. corretly process the tag and HTML redirects 5. support file redirection 6. fetchsite is XHTML-clean Fetchsite also has a number of Shortcomings, which make it suboptimal as a general wget replacement: 1. It only works on valid HTML pages. fetchsite will break, sometimes horribly, if confronted with incorrect HTML 2. Fetchsite is specialized for Usage with Zope: it does recognize index_html as index document instead of index. 3. It chokes on xml directives ( constructs) 4. It has more Bugs than wget :-) Several features are on the TODO list: o parse bobo_exception headers o remove duplicate files on redirects using the rewrite engine o better JavaScript parsing o somehow handle xml directives o better error handling o make filename normalization configurable 1. Pathname normalization ------------------------- Fetchsite stores all downloaded dokuments in a directory hierarchy mirroring the structure of the website. All destination paths are normalized in the following manner: o If a file redirection applies to the source URL, it is applied o If no file redirection applies, every path component is processed in the following way: o The component is folded to lowercase and truncated to 8.3 characters o The component is uniquified by appending letters or changing the last letters of the name until it is unique o The file extension is adjusted according to the content-type of the document 2. Directory default views -------------------------- If during pathname normalization a name collision document <-> directory is encountered, we assume, that the document is the directories default view and save the document as index document in the directory. For example, when downloading http://no.where.org/this/is/a/test http://no.where.org/this/is/a/test/again fetchsite will find a document <-> directory collision at 'this/is/a/test' and will therefore store: http://no.where.org/this/is/a/test -> this/is/a/test/index.html http://no.where.org/this/is/a/test/again -> this/is/a/test/again.html (assuming, both are html documents) 3. CSS and JavaScript paring ---------------------------- fetchsite employs very somple CSS and JavaScript parsing using regular expressions. In CSS this is not realy a problem, since CSS very seldom contains regions of plain text. In JavaScript, this approach might uncorrectly scramble JavaScript code. Therefore, JavaScript statements are *only* processed in