Skip to content

martin-sucha/sitetostatic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sitetostatic

Scrape a website so that it can be served from static files instead.

This is similar to httrack and wget -m, but there are a few differences as neither tool did exactly what I wanted. You might want to consider these alternatives for your use case.

I wanted to preserve original responses, including headers.

How to scrape

sitetostatic scrape --allow-root http://example.com/ repository-path http://example.com/
sitetostatic files repository-path output-path

Similar result could be achieved with wget

cd output-path
wget -mpE http://example.com

but this does not preserve headers anywhere.

Alternately, you could use httrack to generate the files to serve:

httrack http://example.com/ -O output-path,repository-path -%v -k -%p -d -%q

with -k this stores all the files in the cache (repository-path).

However, the files in output-path don't contain original file extensions. For example if URL has file.aspx and contains HTML, httrack outputs file.html, sitetostatic files outputs file.aspx.html.

The -%q (--include-query-string) httrack options doesn't seem to work for me to include the query string the filename.

Verifying that you are serving the same data

There is a sitetostatic diff command to compare two repositories of scraped data (or httrack caches). This is useful when you want to verify that the new web server returns the same data as the old site. Just scrape also the new one and run sitetostatic diff.

About

Scrape a website for serving static files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages