Skip to content

gfinol/teragen-lithops

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TeraSort dataset creator for Python

This is a python implementation of TeraGen using Lithops. It generates a dataset for the sort benchmark. This implementation creates the dataset using FaaS and stores it in an object storage. The dataset is created in parallel using Lithops. It has been inspired by the hadoop implementation and this spark implementation.

Install dependencies

The only dependency is Lithops. You can install it using pip:

pip3 install lithops

or using the requirements.txt file:

pip3 install -r requirements.txt

Set up

You need to set up the config file for Lithops. You can find a template in the Lithops repository.

Usage

You can run the teragen.py script using the following command:

python3 teragen.py -s <size> -b <bucket> -k <key> -p <partitions> -c <config_file>

Parameters

The script takes the following parameters:

  • -s: Size of the dataset to generate. Examples: 100k, 5m, 10g, 1t. Or just the number of bytes.
  • -b: Bucket name to store the dataset.
  • -k: Key name prefix for the files created.
  • -p: Number of partitions files to create. Lithops will create a worker for each partition.
  • -c: Lithops config file path
  • --ascii: Use only printable characters in the dataset. Default: False
  • --localhost Execute the function locally using processes. Default: False
  • -h: Show help message.
  • --unique-file: Create a unique file instead of multiple files. Uses S3 multipart upload. Requires S3 as the configured Lithops storage backend. Default: False

About

A TeraGen implementation using Lithops

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages