From cb690af999e9619f77adeef1200bbfdc0e1de07e Mon Sep 17 00:00:00 2001 From: John Beieler Date: Mon, 5 Sep 2016 21:34:25 -0400 Subject: [PATCH] Update README. --- README.md | 51 +++++++++++++++++++++++++++++++++------------------ 1 file changed, 33 insertions(+), 18 deletions(-) diff --git a/README.md b/README.md index af5b1fe..34df1c2 100644 --- a/README.md +++ b/README.md @@ -3,26 +3,41 @@ atlas Distributed web scraper for political news content. -##Use - -Spawn a few worker processes either in a new shell or using something like -supervisor: +In short, the program pulls news links from RSS feeds, checks whether they've +been scraped yet, sends the URL to a worker queue, and spawns worker processes +to do the page scraping from the worker queue. + +##Whats new in v2 + +The new version of `atlas` is based on Docker and docker-compose. Each of +the processes, the page extractor and RSS extractor, resides in its own docker container. +Through the use of `docker-compose`, all of the dependencies are installed and +linked to the scraping components. The IP information for the dependencies is +passed through commandline arguments, however, which means that the information +can be modified as needed. + +**But why Docker?** + +There are pros and cons to the use of Docker and `docker-compose` for the +deploy and management of `atlas`. The cons are mainly related to the fairly +rigid structure that `docker-compose` imparts on the linkages between pieces. +Additionally, there are some parts that are hardcoded in to the extractors +based on the assumption of Docker and `docker-compose`. It's possible to modify +all of these things, however, and a relatively sophisticated end user should be +able to get the pieces up and running in whatever configuration they wish. In +these scenarios the Docker information provides a decent template for getting +started. All of this is outweighed by the pros of the Docker setup, which is +mainly that deploying and managing all of the dependencies is *much* easier. +`docker-compose` also makes scaling the various pieces relatively easy. -``` -python pages.py -``` - -Then spawn a single process of the main script: +##Use -``` -python rss.py -``` +Basic usage: -And let it rip. +`docker-compose up -d` -##Other Notes +`docker-compose stop` -If you're using supervisor, which you should be, you should write the stdout of -the worker and primary processes to log files. There's also a log file in the -`atlas` directory that picks up the logging messages that are scattered -throughtout the code, such as when a page doesn't return any results. +More advanced users should read the various guides to Docker and +`docker-compose` to determine how best to setup the program for their specific +needs.