Skip to content

Wiki to Text

Adrian Wilke edited this page Feb 10, 2021 · 5 revisions

Bash script (Pandoc, sed)

#!/bin/bash

if [[ $# -ne 2 ]] ; then
    echo 'Please provide: <input directory> <output directory>'
    exit 1
fi

# Remove slash at end
INDIR=${1%/}
OUTDIR=${2%/}

# Create directory if not exists
mkdir -p $OUTDIR

for FILEPATH in $INDIR/*
do
  # Only file name
  FILE="$(basename -- $FILEPATH)"

  # Convert from wiki-markup to plain text
  pandoc -f mediawiki -t plain -o $OUTDIR/$FILE $INDIR/$FILE

  # Remove markers [1]
  sed -i 's/\[[^]]*\]//g' $OUTDIR/$FILE
  
  # Remove empty lines
  sed -i '/^[[:space:]]*$/d' $OUTDIR/$FILE
done

# https://github.com/EML4U/WikimediaDumpExtractor/wiki/Wiki-to-Text
# Data Science Group (DICE) at Paderborn University
# This work has been supported by the German Federal Ministry of Education and Research (BMBF) within the project EML4U under the grant no 01IS19080B.

Notes

  • Slow: Extraction of 699,988 text files takes around 15 hours (13 seconds per text file)
  • Errors: Pandoc 2.11.4 exits sometimes because of parsing errors
Clone this wiki locally