Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO NOT MERGE - Pipeline performance test project #4154

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 151 additions & 0 deletions performance-test/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
##########################
# KEDRO PROJECT

# ignore all local configuration
conf/local/**
!conf/local/.gitkeep

# ignore potentially sensitive credentials files
conf/**/*credentials*

# ignore everything in the following folders
data/**

# except their sub-folders
!data/**/

# also keep all .gitkeep files
!.gitkeep

# keep also the example dataset
!data/01_raw/*


##########################
# Common files

# IntelliJ
.idea/
*.iml
out/
.idea_modules/

### macOS
*.DS_Store
.AppleDouble
.LSOverride
.Trashes

# Vim
*~
.*.swo
.*.swp

# emacs
*~
\#*\#
/.emacs.desktop
/.emacs.desktop.lock
*.elc

# JIRA plugin
atlassian-ide-plugin.xml

# C extensions
*.so

### Python template
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
.static_storage/
.media/
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# mkdocs documentation
/site

# mypy
.mypy_cache/
1 change: 1 addition & 0 deletions performance-test/.viz/stats.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{}
19 changes: 19 additions & 0 deletions performance-test/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# performance-test

Check warning on line 1 in performance-test/README.md

View workflow job for this annotation

GitHub Actions / vale

[vale] performance-test/README.md#L1

[Kedro.headings] 'performance-test' should use sentence-style capitalization.
Raw output
{"message": "[Kedro.headings] 'performance-test' should use sentence-style capitalization.", "location": {"path": "performance-test/README.md", "range": {"start": {"line": 1, "column": 3}}}, "severity": "WARNING"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's more helpful to document how this project should be used, otherwise I suggest removing it as these template doesn't add much information for us.


## Overview

This is a test project meant to simulate delays in specific parts of a Kedro pipeline. It's supposed to be a tool to gauge pipeline performance and be used to compare in-development changes to Kedro with an already stable release version.

## Usage

There are three delay parameters that can be set in this project:

**hook_delay** - Simulates slow-loading hooks due to it performing complex operations or accessing external services that can suffer from latency.

Check warning on line 11 in performance-test/README.md

View workflow job for this annotation

GitHub Actions / vale

[vale] performance-test/README.md#L11

[Kedro.Spellings] Did you really mean 'hook_delay'?
Raw output
{"message": "[Kedro.Spellings] Did you really mean 'hook_delay'?", "location": {"path": "performance-test/README.md", "range": {"start": {"line": 11, "column": 3}}}, "severity": "WARNING"}

**dataset_load_delay** - Simulates a delay in loading a dataset, because of a large size or connection latency, for example.

Check warning on line 13 in performance-test/README.md

View workflow job for this annotation

GitHub Actions / vale

[vale] performance-test/README.md#L13

[Kedro.Spellings] Did you really mean 'dataset_load_delay'?
Raw output
{"message": "[Kedro.Spellings] Did you really mean 'dataset_load_delay'?", "location": {"path": "performance-test/README.md", "range": {"start": {"line": 13, "column": 3}}}, "severity": "WARNING"}

**file_save_delay** - Simulates a delay in saving an output file, because of, for example, connection delay in accessing remote storage.

Check warning on line 15 in performance-test/README.md

View workflow job for this annotation

GitHub Actions / vale

[vale] performance-test/README.md#L15

[Kedro.Spellings] Did you really mean 'file_save_delay'?
Raw output
{"message": "[Kedro.Spellings] Did you really mean 'file_save_delay'?", "location": {"path": "performance-test/README.md", "range": {"start": {"line": 15, "column": 3}}}, "severity": "WARNING"}

When invoking the `kedro run` command, you can pass the desired value in seconds for each delay as a parameter using the `--params` flag. For example:

`kedro run --params=hook_delay=5,dataset_load_delay=5,file_save_delay=5`
58 changes: 58 additions & 0 deletions performance-test/conf/base/catalog.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
congress_expenses:
type: spark.SparkDataset
filepath: data/gastos-deputados.csv
file_format: csv
load_args:
header: True
inferSchema: True

expenses_per_party:
type: spark.SparkDataset
filepath: data/output/expenses_per_party.csv
file_format: csv
save_args:
sep: ','
header: True
mode: overwrite
load_args:
header: True
inferSchema: True

largest_expense_source:
type: spark.SparkDataset
filepath: data/output/largest_expense_source.parquet
file_format: parquet
save_args:
sep: ','
header: True
mode: overwrite

top_spender_per_party:
type: spark.SparkDataset
filepath: data/output/top_spender_per_party.csv
file_format: csv
save_args:
sep: ','
header: True
mode: overwrite
load_args:
header: True
inferSchema: True

top_overall_spender:
type: spark.SparkDataset
filepath: data/output/top_overall_spender.parquet
file_format: parquet
save_args:
sep: ','
header: True
mode: overwrite

top_spending_party:
type: spark.SparkDataset
filepath: data/output/top_spending_party.parquet
file_format: parquet
save_args:
sep: ','
header: True
mode: overwrite
3 changes: 3 additions & 0 deletions performance-test/conf/base/parameters.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
hook_delay: 0
dataset_load_delay: 0
file_save_delay: 0
5 changes: 5 additions & 0 deletions performance-test/conf/base/parameters_expense_analysis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# This is a boilerplate parameters config generated for pipeline 'expense_analysis'
# using Kedro 0.19.8.
#
# Documentation for this file format can be found in "Parameters"
# Link: https://docs.kedro.org/en/0.19.8/configuration/parameters.html
8 changes: 8 additions & 0 deletions performance-test/conf/base/spark.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# You can define spark specific configuration here.

spark.driver.maxResultSize: 3g
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.pyspark.enabled: true

# https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#tips-for-maximising-concurrency-using-threadrunner
spark.scheduler.mode: FAIR
Empty file.
Empty file.
43 changes: 43 additions & 0 deletions performance-test/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
[build-system]
requires = [ "setuptools",]
build-backend = "setuptools.build_meta"

[project]
name = "performance_test"
readme = "README.md"
dynamic = [ "dependencies", "version",]

[project.scripts]
performance-test = "performance_test.__main__:main"

[tool.kedro]
package_name = "performance_test"
project_name = "performance-test"
kedro_init_version = "0.19.8"
tools = [ "PySpark", "Linting",]
example_pipeline = "False"
source_dir = "src"

[tool.ruff]
line-length = 88
show-fixes = true
select = [ "F", "W", "E", "I", "UP", "PL", "T201",]
ignore = [ "E501",]

[project.entry-points."kedro.hooks"]

[tool.ruff.format]
docstring-code-format = true

[tool.setuptools.dynamic.dependencies]
file = "requirements.txt"

[tool.setuptools.dynamic.version]
attr = "performance_test.__version__"

[tool.setuptools.packages.find]
where = [ "src",]
namespaces = false

[tool.kedro_telemetry]
project_id = ""
11 changes: 11 additions & 0 deletions performance-test/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
ipython>=8.10
jupyterlab>=3.0
kedro~=0.19.8
kedro-datasets>=3.0; python_version >= "3.9"
kedro-datasets>=1.0; python_version < "3.9"
kedro-viz>=6.7.0
kedro[jupyter]
notebook
ruff~=0.1.8
scikit-learn~=1.5.1; python_version >= "3.9"
scikit-learn<=1.4.0,>=1.0; python_version < "3.9"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyspark should probably be here

4 changes: 4 additions & 0 deletions performance-test/src/performance_test/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
"""performance-test
"""

__version__ = "0.1"
24 changes: 24 additions & 0 deletions performance-test/src/performance_test/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
"""performance-test file for ensuring the package is executable
as `performance-test` and `python -m performance_test`
"""
import sys
from pathlib import Path
from typing import Any

from kedro.framework.cli.utils import find_run_command
from kedro.framework.project import configure_project


def main(*args, **kwargs) -> Any:
package_name = Path(__file__).parent.name
configure_project(package_name)

interactive = hasattr(sys, 'ps1')
kwargs["standalone_mode"] = not interactive

run = find_run_command(package_name)
return run(*args, **kwargs)


if __name__ == "__main__":
main()
27 changes: 27 additions & 0 deletions performance-test/src/performance_test/hooks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
from time import sleep

from kedro.framework.hooks import hook_impl
from pyspark import SparkConf
from pyspark.sql import SparkSession


class SparkHooks:
@hook_impl
def after_context_created(self, context) -> None:
"""Initialises a SparkSession using the config
defined in project's conf folder.
"""

# Load the spark configuration in spark.yaml using the config loader
parameters = context.config_loader["spark"]
spark_conf = SparkConf().setAll(parameters.items())

# Initialise the spark session
spark_session_conf = (
SparkSession.builder.appName(context.project_path.name)
.enableHiveSupport()
.config(conf=spark_conf)
)
sleep(context.params['hook_delay'])
_spark_session = spark_session_conf.getOrCreate()
_spark_session.sparkContext.setLogLevel("WARN")
16 changes: 16 additions & 0 deletions performance-test/src/performance_test/pipeline_registry.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
"""Project pipelines."""
from typing import Dict

from kedro.framework.project import find_pipelines
from kedro.pipeline import Pipeline


def register_pipelines() -> Dict[str, Pipeline]:
"""Register the project's pipelines.
Returns:
A mapping from pipeline names to ``Pipeline`` objects.
"""
pipelines = find_pipelines()
pipelines["__default__"] = sum(pipelines.values())
return pipelines
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
"""
This is a boilerplate pipeline 'expense_analysis'
generated using Kedro 0.19.8
"""

from .pipeline import create_pipeline

__all__ = ["create_pipeline"]

__version__ = "0.1"
Loading