Skip to content

WikimediaDumpExtractor extracts pages from Wikimedia/Wikipedia database backup dumps.

License

Notifications You must be signed in to change notification settings

EML4U/WikimediaDumpExtractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WikimediaDumpExtractor

Usage

Usage: java -jar WikimediaDumpExtractor.jar
 pages      <input XML file> <output directory> <categories> <search terms> <ids>
 categories <input SQL file> <output directory> [minimum category size, default 10000]
The values <categories> and <search terms> can contain multiple entries separated by '|'
Website: https://github.com/EML4U/WikimediaDumpExtractor

Example

Download the example XML file. It contains 4 pages extracted from the enwiki 20080103 dump. Then run the following command:

java -jar WikimediaDumpExtractor.jar pages enwiki-20080103-pages-articles-example.xml ./ "Social philosophy" altruism ""

Afterwards, files similar to example result will be created.

Process large files

To process large XML files (e.g. enwiki 20080103 has 15 GB, enwiki 20210901 has 85 GB), set the following 3 parameters:

java -DentityExpansionLimit=0 -DtotalEntitySizeLimit=0 -Djdk.xml.totalEntitySizeLimit=0 -jar WikimediaDumpExtractor.jar ...

How to get data

Get Wikimedia dumps here:

Credits

Data Science Group (DICE) at Paderborn University

This work has been supported by the German Federal Ministry of Education and Research (BMBF) within the project EML4U under the grant no 01IS19080B.