WikimediaDumpExtractor

WikimediaDumpExtractor extracts pages from Wikimedia/Wikipedia database backup dumps.
Download the latest release

Usage

Usage: java -jar WikimediaDumpExtractor.jar
 pages      <input XML file> <output directory> <categories> <search terms> <ids>
 categories <input SQL file> <output directory> [minimum category size, default 10000]
The values <categories> and <search terms> can contain multiple entries separated by '|'
Website: https://github.com/EML4U/WikimediaDumpExtractor

Example

Download the example XML file. It contains 4 pages extracted from the enwiki 20080103 dump. Then run the following command:

java -jar WikimediaDumpExtractor.jar pages enwiki-20080103-pages-articles-example.xml ./ "Social philosophy" altruism ""

Afterwards, files similar to example result will be created.

Process large files

To process large XML files (e.g. enwiki 20080103 has 15 GB, enwiki 20210901 has 85 GB), set the following 3 parameters:

java -DentityExpansionLimit=0 -DtotalEntitySizeLimit=0 -Djdk.xml.totalEntitySizeLimit=0 -jar WikimediaDumpExtractor.jar ...

How to get data

Get Wikimedia dumps here:

dumps.wikimedia.org
- Current dumps of the Wikipedia (english) (now)
- Archived dumps of the Wikipedia (2001 – 2010)
archive.org
- Collection wikimediadownloads + enwiki + data dumps (2012 – now)
- Collection wikipediadumps (2010 – 2011)
Additional information
Note: Dump files can be extracted with bzip2 -dk filename.bz2 to keep archive files.

Credits

Data Science Group (DICE) at Paderborn University

This work has been supported by the German Federal Ministry of Education and Research (BMBF) within the project EML4U under the grant no 01IS19080B.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
examples		examples
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
notes.md		notes.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikimediaDumpExtractor

Usage

Example

Process large files

How to get data

Credits

About

Releases 4

Languages

License

EML4U/WikimediaDumpExtractor

Folders and files

Latest commit

History

Repository files navigation

WikimediaDumpExtractor

Usage

Example

Process large files

How to get data

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 4

Languages