Skip to content

Commit

Permalink
roundup + version bump
Browse files Browse the repository at this point in the history
  • Loading branch information
adbar committed Mar 19, 2020
1 parent 0268e9a commit 1d83a7e
Show file tree
Hide file tree
Showing 5 changed files with 10 additions and 7 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
## Changelog

### 0.6.2
- performance and documentation improved

### 0.6.1
- code base restructured
- bugs fixed and further tests
Expand Down
4 changes: 2 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ On the command-line:
Features
--------

*htmldate* finds original and updated publication dates of web pages. URLs, HTML files or HTML trees are given as input, the library outputs a date string in the desired format. It provides following ways to date a HTML document:
*htmldate* finds original and updated publication dates of web pages using heuristics on HTML code and linguistic patterns. URLs, HTML files or HTML trees are given as input, the library outputs a date string in the desired format. It provides following ways to date a HTML document:

1. **Markup in header**: common patterns are used to identify relevant elements (e.g. ``link`` and ``meta`` elements) including `Open Graph protocol <http://ogp.me/>`_ attributes and a large number of CMS idiosyncracies
2. **HTML code**: The whole document is then searched for structural markers: ``abbr``/``time`` elements and a series of attributes (e.g. ``postmetadata``)
Expand Down Expand Up @@ -252,5 +252,5 @@ Feel free to file bug reports on the `issues page <https://github.com/adbar/html

Kudos to the following software libraries:

- `cchardet <https://github.com/PyYoshi/cChardet>`_, `ciso8601 <https://github.com/closeio/ciso8601>`_, `lxml <http://lxml.de/>`_, `dateparser <https://github.com/scrapinghub/dateparser>`_
- `ciso8601 <https://github.com/closeio/ciso8601>`_, `lxml <http://lxml.de/>`_, `dateparser <https://github.com/scrapinghub/dateparser>`_
- A few patterns are derived from `python-goose <https://github.com/grangier/python-goose>`_, `metascraper <https://github.com/ianstormtaylor/metascraper>`_, `newspaper <https://github.com/codelucas/newspaper>`_ and `articleDateExtractor <https://github.com/Webhose/article-date-extractor>`_. This module extends their coverage and robustness significantly.
2 changes: 1 addition & 1 deletion htmldate/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
__author__ = 'Adrien Barbaresi'
__license__ = 'GNU GPL v3'
__copyright__ = 'Copyright 2017-2020, Adrien Barbaresi'
__version__ = '0.6.1'
__version__ = '0.6.2'


import logging
Expand Down
4 changes: 2 additions & 2 deletions htmldate/extractors.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# pylint:disable-msg=E0611,I1101
"""
Custom parsers and X-Path expressions for date extraction
Custom parsers and XPath expressions for date extraction
"""
## This file is available from https://github.com/adbar/trafilatura
## This file is available from https://github.com/adbar/htmldate
## under GNU GPL v3 license

# standard
Expand Down
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@ def readme():

setup(
name='htmldate',
version='0.6.1',
description='Find the creation date of web pages using a combination of tree traversal, common structural patterns, text-based heuristics and robust date extraction.',
version='0.6.2',
description='Fast and robust extraction of original and updated publication dates from web pages.',
long_description=readme(),
classifiers=[
# As from http://pypi.python.org/pypi?%3Aaction=list_classifiers
Expand Down

0 comments on commit 1d83a7e

Please sign in to comment.