Release v0.9.0 · VIDA-NYU/ache

We are pleased to announce version 0.9.0 of ACHE Focused Crawler! We also recently reached the milestone of 100+ starts on GitHub, 55+ forks, and 1000+ commits in the current git repository. We would like to thanks all users for the feedback we have received in the past year.

This is a large release and it brings many improvements to the documentation and several new features. Following is a detailed log of major changes since last version:

Fixed multiple bugs and handling of exceptions
Several improvements made to ACHE documentation
Allow use of multiple data formats simultaneously (issue #92)
Added new data storage format using the standard WARC format (issue #64)
Added new data storage format using Apache Kafka (issue #123)
Re-crawling of sitemaps.xml files using fixed time intervals (issue #73)
Allow configuration of cookies in ache.yml (issue #81)
Allow configuration of full User-Agent string
Fixed memory issues that would cause OutOfMemoryError (issue #63)
Support for robots exclusion protocol a.k.a. robots.txt (issue #46)
Added new HTTP fetcher implementation using okhttp3 library with support to multiple SSL cipher suites
Non-HTML pages are no longer parsed as HTML
Training of new link classifiers (Online Learning) in a background thread (issue #76)
Added REST API endpoint to stop crawler
Added REST API endpoint to add new seeds to the crawl
Added documentation for the REST API
Persist run-time crawl metrics across crawler restarts (issue #101)
Added support to per-domain wildcard link filters (issue #121)
Add more detailed metrics for HTTP response codes (issue #120)
Changed referrer policies in the search dashboard for better security
Added various configuration options for timeouts in both fetcher implementations (issue #122)
Added support for Basic HTTP authentication in the web interface (issue #129)
Added REST API endpoints to supporting monitoring using Prometheus.io (issue #128)
Add page relevance metrics for better monitoring (issue #119)
Add parameters for elasticsearch index and type names through the /startCrawl REST API (issue #107)
Support for serving web interface from non-root path (issue #137)
Added button to stop crawler in web user interface (issue #139)
Upgraded searchkit library to 2.2.0 which supports Elasticsearch 5.x
Upgrade crawler-commons library to version 0.8

Notice: that there were breaking changes in some data formats:

Repositories for relevant and irrelevant pages are now stored in the same folder (or same Elasticsearch index) and page entries include new properties to identify pages as relevant or irrelevant according to the target page classifier output. Double check the data formats documentation page and make sure you make appropriate changes if needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.9.0