HappyCrawl

What's this:

An easy and clearly to use framework to let you focus on information extract process instead of how to fetch webpages. Usually web info mining can be tedious due to unclear mining logic. In this framework, roles are set to make the extract process clear and clean.

The whole process seems like MapReduce.

There are three roles in this framework.

A Pager is responding for store a page and extract information locally to key-value pairs. A Pager only need to initialized with a URI
A Fetcher will fetch a collection of pagers based on their URI. URI means these page can both on local disk or from web. The fetching process is in multi-threads.
A Combiner will scan a list of Pagers fetched and combine all the key:value pairs as you will.

Powered by Jsoup, you can use css selector in your code to select the element(s).

Demo: Here is a page contains a list of URLs which direct me to some interesting articles. So I want to start with this page and collect all the articles into one HTML file which may be coverted into a PDF or something else. But a have download the content page into my local disk

So I derived a FilePager constructed with the path to the file. When override the convert() method, I select all the <div> with special selector and output them with the key of the index of thire occurence.
Then I implements a Combiner to combine all the previous together in order of their occurence.(This combiner is not necessary, infact you can directly handle the output in main function)
Use the combined info, I instance many other new Pagers which is responding to extract only the <div> contains the article in the page.
Implements a new Combiner to collect all the <div> to one HTML file.

TODO:

Use maven to manage JARs;
Add local cache for HttpParser, so when debugging the program, it need not to crawl the web everytime.
Add common Pagers and Combiners

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.settings		.settings
lib		lib
pics		pics
src		src
test/anchor89/test		test/anchor89/test
.classpath		.classpath
.gitignore		.gitignore
.project		.project
README.md		README.md
book.html		book.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HappyCrawl

About

Releases

Packages

Languages

Anchor89/HappyCrawl

Folders and files

Latest commit

History

Repository files navigation

HappyCrawl

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages