Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add multiprocessing #92

Open
wants to merge 99 commits into
base: developer
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
99 commits
Select commit Hold shift + click to select a range
0e2242e
Add files for multiprocessing
Apr 18, 2024
2add9b9
Update identify_associations_multiprocess.py
LFT18 Apr 18, 2024
f645ea4
Clean multiprocessing script
LFT18 Apr 19, 2024
f471704
Update __main__.py multiprocessing
LFT18 Apr 19, 2024
85c28e5
Update schema.py multiprocessing
LFT18 Apr 19, 2024
6330e92
Update __init__.py multiprocessing
LFT18 Apr 19, 2024
bbe1b4e
Update preprocessing.py
LFT18 Apr 19, 2024
820c554
:fire: clean-up duplicated src/move files (pkg was in main folder)
Apr 22, 2024
ce9a9dc
:sparkles: add identify_associations_multiprocess to src/move/tasks
Apr 23, 2024
5327223
:bug: make mutliprocessing not stale: assign # of threads for each pr…
Apr 23, 2024
eaa858a
Merge pull request #1 from enryH/main
LFT18 Apr 23, 2024
5ab5e59
Updated identify_associations_multiprocess.py
Apr 23, 2024
33f565a
Update config files for small tries
Apr 23, 2024
63f128b
Multiprocessing for analyze_latent
Apr 24, 2024
ca389d2
Analyze latent multiprocessing
Apr 24, 2024
e08a94b
Analyze latent multiprocessing
Apr 24, 2024
e94ef90
Fix bayes_k calculation
Apr 25, 2024
f4f0aa3
Fix analyze_latent_multiprocessing
Apr 26, 2024
6a0b665
Update and new functions
May 11, 2024
a5310a6
Delete files and fix multiloop
May 11, 2024
e67bb75
Clean identify_association_multiprocess.py
May 21, 2024
86bfed5
Clean analyze_latent multiprocessing.py
May 21, 2024
f9d4961
Update perturbations.py
LFT18 Jun 7, 2024
c2c49e8
Update perturbations.py
LFT18 Jun 10, 2024
0a4bcae
Delete src/move/tasks/analyze_latent_efficient.py
LFT18 Jun 13, 2024
ce20dac
Delete src/move/tasks/analyze_latent_multiprocessing.py
LFT18 Jun 13, 2024
4a72842
Delete src/move/tasks/identify_associations_multiprocess_loop.py
LFT18 Jun 13, 2024
f537d21
Delete src/move/tasks/identify_associations_multiprocess_may.py
LFT18 Jun 13, 2024
568aaa8
Delete src/move/tasks/identify_associations_selected.py
LFT18 Jun 13, 2024
ede3707
Delete src/move/tasks/analyze_latent_original.py
LFT18 Jun 13, 2024
5df8a01
Remove multiprocess_loop
LFT18 Jun 13, 2024
52c37fd
Remove multiprocess_loop
LFT18 Jun 13, 2024
2e29f23
:art: format with black
Jun 18, 2024
13678cb
Merge branch 'main' into LFT18-main
Jun 18, 2024
f92a862
Merge branch 'developer' into LFT18-main
Jun 18, 2024
b7824f7
:art: add trigger of actions from PR
Jun 20, 2024
e5253a2
:art: format with black
Jun 20, 2024
d4118a3
:fire: remove duplicated code and intermediate scripts
Jun 20, 2024
3e19e24
Merge branch 'developer' into LFT18-main
Jun 21, 2024
dbc0238
:bug: fix f-string formatting errors
Jun 24, 2024
80352e6
:bug: remove unused imports
Jun 24, 2024
488b4a4
:rewind: add configuration files back in from developer branch
Jul 3, 2024
9b3a27e
:art: isort imports
Jul 3, 2024
05d4c34
:construction: see if this advances CI to the next step
Jul 3, 2024
ebb72ad
:fire: remove intermediate files of development
Jul 3, 2024
efbfd5c
:construction: multiprocess only defined for bayes factors
Jul 3, 2024
fbbeb19
:bug: remove non-existing, intermediate tasks (used for developing), …
Jul 3, 2024
cb3ad30
:bug: also deactivate mutliprocessing for KS as it's not implemented
Jul 3, 2024
8c61d35
:art: fix flake8-bugbear issues except missing multiprocessing of t-t…
Jul 3, 2024
5aa03ff
:bug: format and fix import
Jul 3, 2024
8b06298
:bug: use perturb_continuous_data_extended from perturbations
Jul 3, 2024
6cfd1f8
:fire: comments and old configurations; format
Jul 4, 2024
8277891
:fire: remove duplicated functionality
Jul 4, 2024
44802eb
:sparkles: integrate multiprocessing into analyze_latent.py
Jul 4, 2024
f64d779
:sparkles: merge multiprocessing bayes factors into identify_associat…
Jul 4, 2024
6a17110
:fire: remove old schema entries, increase run time
Jul 5, 2024
b8b4769
:zip: do no save intermediate files for single-process bayes_approach
Jul 5, 2024
2927d3f
:fire: remove comments
Jul 5, 2024
202eb74
:fire: remove unused code
Jul 5, 2024
d6bc896
:art: reorder functions
Jul 5, 2024
b99ce97
:construction: move bayes_parallel to own module
Jul 5, 2024
171c915
:construction: unify interface
Jul 5, 2024
95c9f40
:fire: remove not-used code
Jul 5, 2024
e556fae
:art: start separating recurrent code into fcts
Jul 5, 2024
1a2774d
:art: adapt to look more similar to single-core bayes factor fct
Jul 5, 2024
680c164
:sparkles: add back masking of self-perturbed feat.
Jul 5, 2024
8f43c90
:art: initiailize logger at the top of the module
Jul 8, 2024
6178c8f
:sparkles: pass feature_mask to bayes_parallell
Jul 8, 2024
60ed227
:bug: add condition for masking
Jul 8, 2024
f5da671
:art: align masking strategies
Jul 8, 2024
7eae82d
:bug: fix cont perturbation
Jul 8, 2024
50a623e
:bug: remove redefintion of nan_mask
Jul 8, 2024
2df057b
:art: only define logger once in module
Jul 8, 2024
2e87e12
:art: align single process bayes and multiprocess bayes fct
Jul 8, 2024
aa2e5d9
:art: just document in code that this cannot happen
Jul 8, 2024
4041fcb
:zap: improve CI speed, reduce stability (-> one refit only)
Jul 8, 2024
2d004e0
:bug: use default no. of epochs + t-test needs 4 refits
Jul 8, 2024
8d65528
:zap: do not run t-test check (for now)
Jul 9, 2024
1dd6788
:zap: bump up bayes factor training
Jul 9, 2024
dc9020e
:art: train both refits with 100 epochs
Jul 9, 2024
9cd2a7b
:sparkles: add log2 option
Jul 9, 2024
a4911d7
:art: document some more
Jul 9, 2024
e0421bd
:zap: test multiprocess on continuous tutorial
Jul 9, 2024
c70d328
:bug: remove non-exisitng key
Jul 9, 2024
1c72316
:sparkles: build dataloader fct
Jul 9, 2024
58f08e4
:bug: fix minor bug (wrongly assigned feat)
Jul 9, 2024
f895237
:zap: move masking code into main fct of module
Jul 9, 2024
5eb7954
:art: move feat_mask creation out
Jul 9, 2024
8c4e53b
:ambulance: temp. fix of CI
Jul 9, 2024
49a93d0
:zap: do not build dataloaders for multiprocessing
Jul 9, 2024
709c674
:construction: test t-test again, re-run pert. w/o model training
Jul 10, 2024
6e65cc6
:sparkles: add categorical pert. to multiprocessing
Jul 10, 2024
4efbdd9
:fire: remove unused code
Jul 10, 2024
dab767a
:art: remove unused argument
Jul 10, 2024
980bbce
:rewind: checkout developer version
Jul 12, 2024
c26b2dd
:art: move shared key to base class
Jul 12, 2024
05c1735
:fire: remove comments and code duplications
Jul 12, 2024
c5002cd
:art: update type hints, remove unused import
Jul 12, 2024
fe8c48b
Merge branch 'developer' into main
enryH Aug 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions analysis/__init__.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to me: Check folder and most likely delete entire folder as duplicate.

Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
__all__ = ["calculate_accuracy", "calculate_cosine_similarity"]

from move.analysis.metrics import (
calculate_accuracy,
calculate_cosine_similarity,
)
99 changes: 99 additions & 0 deletions analysis/metrics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
__all__ = ["calculate_accuracy", "calculate_cosine_similarity"]

import numpy as np

from move.core.typing import FloatArray


def calculate_accuracy(
original_input: FloatArray, reconstruction: FloatArray
) -> FloatArray:
"""Compute accuracy per sample.

Args:
original_input: Original labels (one-hot encoded as a 3D array).
reconstruction: Reconstructed labels (2D array).

Returns:
Array of accuracy scores.
"""
if original_input.ndim != 3:
raise ValueError("Expected original input to have three dimensions.")
if reconstruction.ndim != 2:
raise ValueError("Expected reconstruction to have two dimensions.")
if original_input[:, :, 0].shape != reconstruction.shape:
raise ValueError(
f"Original input {original_input.shape} and reconstruction "
f"{reconstruction.shape} shapes do not match."
)

is_nan = original_input.sum(axis=2) == 0
original_input = np.argmax(original_input, axis=2) # 3D => 2D
y_true = np.ma.masked_array(original_input, mask=is_nan)
y_pred = np.ma.masked_array(reconstruction, mask=is_nan)

num_features = np.ma.count(y_true, axis=1)
scores = np.sum(y_true == y_pred, axis=1) / num_features

return np.ma.filled(scores, 0)


def calculate_cosine_similarity(
original_input: FloatArray, reconstruction: FloatArray
) -> FloatArray:
"""Compute cosine similarity per sample.

Args:
original_input: Original values (2D array).
reconstruction: Reconstructed values (2D array).

Returns:
Array of similarities.
"""
if any((original_input.ndim != 2, reconstruction.ndim != 2)):
raise ValueError("Expected both inputs to have two dimensions.")
if original_input.shape != reconstruction.shape:
raise ValueError(
f"Original input {original_input.shape} and reconstruction "
f"{reconstruction.shape} shapes do not match."
)

is_nan = original_input == 0
x = np.ma.masked_array(original_input, mask=is_nan)
y = np.ma.masked_array(reconstruction, mask=is_nan)

# Equivalent to `np.diag(sklearn.metrics.pairwise.cosine_similarity(x, y))`
# But can handle masked arrays
scores = np.sum(x * y, axis=1) / (norm(x) * norm(y))

return np.ma.filled(scores, 0)


def norm(x: np.ma.MaskedArray, axis: int = 1) -> np.ma.MaskedArray:
"""Return Euclidean norm. This function is equivalent to `np.linalg.norm`,
but it can handle masked arrays.

Args:
x: 2D masked array
axis: Axis along which to the operation is performed. Defaults to 1.

Returns:
1D array with the specified axis removed.
"""
return np.sqrt(np.sum(x**2, axis=axis))


def get_2nd_order_polynomial(x_array, y_array, n_points=100):
"""
Given a set of x an y values, find the 2nd oder polynomial fitting best the data.

Returns:
x_pol: x coordinates for the polynomial function evaluation.
y_pol: y coordinates for the polynomial function evaluation.
"""
a2, a1, a = np.polyfit(x_array, y_array, deg=2)

x_pol = np.linspace(np.min(x_array), np.max(x_array), n_points)
y_pol = np.array([a2 * x * x + a1 * x + a for x in x_pol])

return x_pol, y_pol, (a2, a1, a)
25 changes: 25 additions & 0 deletions isoforms_first_try/config/data/isoforms.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create an extra branch with your analysis scripts/configs - and then merge the core functionality to the developer branch.

Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
defaults:
- base_data

raw_data_path: data/
interim_data_path: interim_data/
results_path: results/

sample_names: common_samples_500

categorical_inputs:
- name: OS
- name: DSS
- name: PFI
- name: DFI
- name: ajcc_pathologic_tumor_stage
- name: gender
- name: cancer_type_abbreviation
- name: histological_type
- name: vital_status

continuous_inputs:
- name: pheno_continuous_float
- name: iso_IF_100
- name: iso_tpm_100
- name: gene_tpm_100
24 changes: 24 additions & 0 deletions isoforms_first_try/config/task/isoforms__id_assoc_bayes.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
defaults:
- identify_associations_bayes

multiprocess: True

batch_size: 250

num_refits: 2

target_dataset: iso_tpm_100 # Dataset to perturb
target_value: maximum # We perturb in all samples for the maximum value of that feature across all samples
save_refits: True

model:
num_hidden:
- 150
num_latent: 10
beta: 0.001
dropout: .1
cuda: false

training_loop:
lr: 1e-4
num_epochs: 20
7 changes: 7 additions & 0 deletions isoforms_first_try/encode_data_iso.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from move.tasks import encode_data
from move.data import io
from move.data import preprocessing

config = io.read_config("isoforms", "encode_data")
encode_data(config.data)

23 changes: 23 additions & 0 deletions isoforms_first_try/encode_data_iso.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash

# The following are commonly used options for running jobs. Remove one
# "#" from the "##SBATCH" lines (changing them to "#SBATCH") to enable
# a given option.

#SBATCH --job-name=encode500
# The number of CPUs (cores) used by your task. Defaults to 1.
###SBATCH --cpus-per-task=10
# The amount of RAM used by your task. Tasks are automatically assigned 15G
# per CPU (set above) if this option is not set.
#SBATCH --mem=800G
# Request a GPU on the GPU code. Use `--gres=gpu:a100:2` to request both GPUs.
##SBATCH --partition=gpuqueue --gres=gpu:a100:1
# Send notifications when job ends. Remember to update the email address!
##SBATCH [email protected] --mail-type=END,FAIL
# Set an error file
#SBATCH --error=encode_iso.err
#SBATCH --out=encode_iso.log


module load python/3.11.3
python encode_data_iso.py
14 changes: 14 additions & 0 deletions isoforms_first_try/encode_iso.err
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
[INFO - encode_data]: Beginning task: encode data
[INFO - encode_data]: Encoding 'OS'
[INFO - encode_data]: Encoding 'DSS'
[INFO - encode_data]: Encoding 'PFI'
[INFO - encode_data]: Encoding 'DFI'
[INFO - encode_data]: Encoding 'ajcc_pathologic_tumor_stage'
[INFO - encode_data]: Encoding 'gender'
[INFO - encode_data]: Encoding 'cancer_type_abbreviation'
[INFO - encode_data]: Encoding 'histological_type'
[INFO - encode_data]: Encoding 'vital_status'
[INFO - encode_data]: Encoding 'pheno_continuous_float'
[INFO - encode_data]: Encoding 'iso_IF_2000_data'
[INFO - encode_data]: Encoding 'iso_tpm_2000_data'
[INFO - encode_data]: Encoding 'gene_tpm_2000_data'
6 changes: 6 additions & 0 deletions isoforms_first_try/identify_assoc_bayes_iso_multiprocess.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from move.tasks import identify_associations_multiprocess
from move.data import io

config = io.read_config("isoforms", "isoforms__id_assoc_bayes")
identify_associations_multiprocess(config)

23 changes: 23 additions & 0 deletions isoforms_first_try/identify_assoc_bayes_iso_multiprocess.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash

# The following are commonly used options for running jobs. Remove one
# "#" from the "##SBATCH" lines (changing them to "#SBATCH") to enable
# a given option.

#SBATCH --job-name=500multi

# The number of CPUs (cores) used by your task. Defaults to 1.
#SBATCH --cpus-per-task=50
# The amount of RAM used by your task. Tasks are automatically assigned 15G
# per CPU (set above) if this option is not set.
#SBATCH --mem=1000G
# Request a GPU on the GPU code. Use `--gres=gpu:a100:2` to request both GPUs.
##SBATCH --partition=gpuqueue --gres=gpu:a100:1
# Send notifications when job ends. Remember to update the email address!
##SBATCH [email protected] --mail-type=END,FAIL
# Set an error file
#SBATCH --error=multi_bayes_iso.err


module load python/3.11.3
python identify_assoc_bayes_iso_multiprocess.py
6 changes: 6 additions & 0 deletions isoforms_first_try/identify_assoc_bayes_iso_notmulti.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from move.tasks import identify_associations
from move.data import io

config = io.read_config("isoforms", "isoforms__id_assoc_bayes")
identify_associations(config)

23 changes: 23 additions & 0 deletions isoforms_first_try/identify_assoc_bayes_normal.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash

# The following are commonly used options for running jobs. Remove one
# "#" from the "##SBATCH" lines (changing them to "#SBATCH") to enable
# a given option.

#SBATCH --job-name=500notmulti_bayes

# The number of CPUs (cores) used by your task. Defaults to 1.
###SBATCH --cpus-per-task=1
# The amount of RAM used by your task. Tasks are automatically assigned 15G
# per CPU (set above) if this option is not set.
#SBATCH --mem=1000G
# Request a GPU on the GPU code. Use `--gres=gpu:a100:2` to request both GPUs.
##SBATCH --partition=gpuqueue --gres=gpu:a100:1
# Send notifications when job ends. Remember to update the email address!
###SBATCH [email protected] --mail-type=END,FAIL
# Set an error file
#SBATCH --error=normal_bayes.err


module load python/3.11.3
python identify_assoc_bayes_iso_notmulti.py
1 change: 1 addition & 0 deletions isoforms_first_try/interim_data/DFI.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
DFI
1 change: 1 addition & 0 deletions isoforms_first_try/interim_data/DSS.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
DSS
1 change: 1 addition & 0 deletions isoforms_first_try/interim_data/OS.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
OS
1 change: 1 addition & 0 deletions isoforms_first_try/interim_data/PFI.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
PFI
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ajcc_pathologic_tumor_stage
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
cancer type abbreviation
1 change: 1 addition & 0 deletions isoforms_first_try/interim_data/gender.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
gender
99 changes: 99 additions & 0 deletions isoforms_first_try/interim_data/gene_tpm_100.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
ENSG00000005302.17
ENSG00000006327.13
ENSG00000006432.15
ENSG00000017260.19
ENSG00000018699.11
ENSG00000026508.16
ENSG00000047249.16
ENSG00000051180.16
ENSG00000052841.14
ENSG00000060069.16
ENSG00000064666.14
ENSG00000065457.10
ENSG00000066044.13
ENSG00000066136.19
ENSG00000066468.20
ENSG00000069345.11
ENSG00000073605.18
ENSG00000074054.17
ENSG00000075415.12
ENSG00000075702.16
ENSG00000078747.12
ENSG00000080845.17
ENSG00000084636.17
ENSG00000090266.12
ENSG00000099899.14
ENSG00000099917.17
ENSG00000100129.17
ENSG00000100142.14
ENSG00000100599.15
ENSG00000100731.15
ENSG00000102226.9
ENSG00000104517.12
ENSG00000104613.11
ENSG00000104814.12
ENSG00000104824.16
ENSG00000105245.9
ENSG00000105323.16
ENSG00000106392.10
ENSG00000106665.15
ENSG00000110700.6
ENSG00000110719.9
ENSG00000117151.12
ENSG00000117394.19
ENSG00000117640.17
ENSG00000122482.20
ENSG00000122484.8
ENSG00000123352.17
ENSG00000124444.15
ENSG00000125122.14
ENSG00000125779.21
ENSG00000125962.14
ENSG00000125967.16
ENSG00000125991.18
ENSG00000130305.16
ENSG00000132199.18
ENSG00000132300.18
ENSG00000132600.16
ENSG00000133641.17
ENSG00000134013.15
ENSG00000135127.11
ENSG00000135801.9
ENSG00000135845.9
ENSG00000137817.16
ENSG00000137821.11
ENSG00000138399.17
ENSG00000139620.12
ENSG00000140598.13
ENSG00000140650.11
ENSG00000141252.19
ENSG00000143554.13
ENSG00000149016.15
ENSG00000149218.4
ENSG00000152240.12
ENSG00000153147.5
ENSG00000156261.12
ENSG00000156508.17
ENSG00000157106.16
ENSG00000157978.11
ENSG00000158863.21
ENSG00000159111.12
ENSG00000160298.17
ENSG00000160570.13
ENSG00000160691.18
ENSG00000161513.11
ENSG00000163110.14
ENSG00000163521.15
ENSG00000165526.8
ENSG00000165637.13
ENSG00000166797.10
ENSG00000166949.15
ENSG00000166971.16
ENSG00000168575.9
ENSG00000168646.12
ENSG00000169249.12
ENSG00000169925.16
ENSG00000169967.16
ENSG00000170142.11
ENSG00000170310.14
ENSG00000171617.13
Loading