Skip to content

Commit

Permalink
Merge branch 'main' into issue_549_disconnected_schema
Browse files Browse the repository at this point in the history
  • Loading branch information
lajohn4747 committed Jun 5, 2024
2 parents d04dff2 + 29cf341 commit 3871c2a
Show file tree
Hide file tree
Showing 34 changed files with 935 additions and 296 deletions.
34 changes: 34 additions & 0 deletions .github/workflows/install.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
name: Install Tests
on:
pull_request:
types: [opened, synchronize]
push:
branches:
- main
jobs:
install:
name: ${{ matrix.python_version }} install
strategy:
fail-fast: true
matrix:
python_version: ["3.8", "3.12"]
runs-on: ubuntu-latest
steps:
- name: Set up python ${{ matrix.python_version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python_version }}
- uses: actions/checkout@v4
- name: Build package
run: |
make package
- name: Install package
run: |
python -m pip install "unpacked_sdist/."
- name: Test by importing packages
run: |
python -c "import sdv"
python -c "import sdv;print(sdv.version.public)"
- name: Check package conflicts
run: |
python -m pip check
1 change: 1 addition & 0 deletions .github/workflows/integration.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ jobs:
integration:
runs-on: ${{ matrix.os }}
strategy:
fail-fast: true
matrix:
python-version: [ '3.8', '3.9', '3.10', '3.11', '3.12']
os: [ubuntu-latest, windows-latest]
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/minimum.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ jobs:
minimum:
runs-on: ${{ matrix.os }}
strategy:
fail-fast: true
matrix:
python-version: [ '3.8', '3.9', '3.10', '3.11', '3.12']
os: [ubuntu-latest, windows-latest]
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/unit.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ jobs:
unit:
runs-on: ${{ matrix.os }}
strategy:
fail-fast: true
matrix:
python-version: [ '3.8', '3.9', '3.10', '3.11', '3.12']
os: [ubuntu-latest, windows-latest]
Expand Down
40 changes: 40 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,45 @@
# Release Notes

## 1.13.1 - 2024-05-16

This release fixes the `ModuleNotFoundError` error that was causing the 1.13.0 release to fail.

## 1.13.0 - 2024-05-15

This release adds a utility function called `get_random_subset` that helps users get a subset of their multi-table data so that modeling can be done quicker. Given a dictionary of table names mapped to DataFrames, metadata, a main table and a desired number of rows to use for the main table, it will subsample the data in a way that maintains referential integrity.

This release also adds two new local file handlers: the `CSVHandler` and the `ExcelHandler`. This enables users to easily load from and save synthetic data to these files types. These handlers return data and metadata in the multi-table format, so we also added the function `get_table_metadata` to get a `SingleTableMetadata` object from a `MultiTableMetadata` object.

Finally, this release fixes some bugs that prevented synthesizers from working with data that had numerical column names.

### New Features

* Add `get_random_subset` poc utility function - Issue [#1877](https://github.com/sdv-dev/SDV/issues/1877) by @R-Palazzo
* Add usage logging - Issue [#1903](https://github.com/sdv-dev/SDV/issues/1903) by @pvk-developer
* Move function `drop_unknown_references` from `poc` to be directly under `utils` - Issue [#1947](https://github.com/sdv-dev/SDV/issues/1947) by @R-Palazzo
* Add CSVHandler - Issue [#1949](https://github.com/sdv-dev/SDV/issues/1949) by @pvk-developer
* Add ExcelHandler - Issue [#1950](https://github.com/sdv-dev/SDV/issues/1950) by @pvk-developer
* Add get_table_metadata function - Issue [#1951](https://github.com/sdv-dev/SDV/issues/1951) by @R-Palazzo
* Save usage log file as a csv - Issue [#1974](https://github.com/sdv-dev/SDV/issues/1974) by @frances-h
* Split out metadata creation from data import in the local files handlers - Issue [#1975](https://github.com/sdv-dev/SDV/issues/1975) by @pvk-developer
* Improve error message when trying to sample before fitting (single table) - Issue [#1978](https://github.com/sdv-dev/SDV/issues/1978) by @R-Palazzo

### Bugs Fixed

* Metadata detection crashes when the column names are integers (`AttributeError: 'int' object has no attribute 'lower'`) - Issue [#1933](https://github.com/sdv-dev/SDV/issues/1933) by @lajohn4747
* Synthesizers crash when column names are integers (`TypeError: unsupported operand`) - Issue [#1935](https://github.com/sdv-dev/SDV/issues/1935) by @lajohn4747
* Switch parameter order in drop_unknown_references - Issue [#1944](https://github.com/sdv-dev/SDV/issues/1944) by @R-Palazzo
* Unexpected NaN values in sequence_index when dataframe isn't reset - Issue [#1973](https://github.com/sdv-dev/SDV/issues/1973) by @fealho
* Fix pandas DtypeWarning in download_demo - Issue [#1980](https://github.com/sdv-dev/SDV/issues/1980) by @fealho

### Maintenance

* Only run unit and integration tests on oldest and latest python versions for macos - Issue [#1948](https://github.com/sdv-dev/SDV/issues/1948) by @frances-h

### Internal

* Update code to remove `FutureWarning` related to 'enforce_uniqueness' parameter - Issue [#1995](https://github.com/sdv-dev/SDV/issues/1995) by @pvk-developer

## 1.12.1 - 2024-04-19

This release makes a number of changes to how id columns are generated. By default, id columns with a regex will now have their values scrambled in the output. Id columns without a regex that are numeric will be created randomly. If they're not numeric, they will have a random suffix.
Expand Down
25 changes: 24 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,10 @@ ifeq ($(CHANGELOG_LINES),0)
$(error Please insert the release notes in HISTORY.md before releasing)
endif

.PHONY: git-push
git-push: ## Simply push the repository to github
git push

.PHONY: check-release
check-release: check-clean check-main check-history ## Check if the release can be made
@echo "A new release can be made"
Expand All @@ -261,5 +265,24 @@ release-major: check-release bumpversion-major release

.PHONY: check-deps
check-deps:
$(eval allow_list='cloudpickle=|graphviz=|numpy=|pandas=|tqdm=|copulas=|ctgan=|deepecho=|rdt=|sdmetrics=|platformdirs=')
$(eval allow_list='cloudpickle=|graphviz=|numpy=|pandas=|tqdm=|copulas=|ctgan=|deepecho=|rdt=|sdmetrics=|platformdirs=|pyyaml=')
pip freeze | grep -v "SDV.git" | grep -E $(allow_list) | sort > $(OUTPUT_FILEPATH)

.PHONY: upgradepip
upgradepip:
python -m pip install --upgrade pip

.PHONY: upgradebuild
upgradebuild:
python -m pip install --upgrade build

.PHONY: upgradesetuptools
upgradesetuptools:
python -m pip install --upgrade setuptools

.PHONY: package
package: upgradepip upgradebuild upgradesetuptools
python -m build ; \
$(eval VERSION=$(shell python -c 'import setuptools; setuptools.setup()' --version))
tar -zxvf "dist/sdv-${VERSION}.tar.gz"
mv "sdv-${VERSION}" unpacked_sdist
20 changes: 12 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,12 +94,12 @@ column and the primary key (`guest_email`).
## Synthesizing Data
Next, we can create an **SDV synthesizer**, an object that you can use to create synthetic data.
It learns patterns from the real data and replicates them to generate synthetic data. Let's use
the `FAST_ML` preset synthesizer, which is optimized for performance.
the [GaussianCopulaSynthesizer](https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/gaussiancopulasynthesizer).

```python
from sdv.lite import SingleTablePreset
from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = SingleTablePreset(metadata, name='FAST_ML')
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=real_data)
```

Expand Down Expand Up @@ -131,11 +131,15 @@ quality_report = evaluate_quality(
```

```
Creating report: 100%|██████████| 4/4 [00:00<00:00, 19.30it/s]
Overall Quality Score: 89.12%
Properties:
Column Shapes: 90.27%
Column Pair Trends: 87.97%
Generating report ...
(1/2) Evaluating Column Shapes: |████████████████| 9/9 [00:00<00:00, 1133.09it/s]|
Column Shapes Score: 89.11%
(2/2) Evaluating Column Pair Trends: |██████████████████████████████████████████| 36/36 [00:00<00:00, 502.88it/s]|
Column Pair Trends Score: 88.3%
Overall Score (Average): 88.7%
```

This object computes an overall quality score on a scale of 0 to 100% (100 being the best) as well
Expand Down
8 changes: 4 additions & 4 deletions latest_requirements.txt
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
cloudpickle==3.0.0
copulas==0.11.0
ctgan==0.10.0
ctgan==0.10.1
deepecho==0.6.0
graphviz==0.20.3
numpy==1.26.4
pandas==2.2.2
platformdirs==4.2.1
rdt==1.12.0
sdmetrics==0.14.0
platformdirs==4.2.2
rdt==1.12.1
sdmetrics==0.14.1
tqdm==4.66.4
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ dependencies = [
'rdt>=1.12.0',
'sdmetrics>=0.14.0',
'platformdirs>=4.0',
'pyyaml>=6.0.1',
]

[project.urls]
Expand Down Expand Up @@ -157,7 +158,7 @@ namespaces = false
version = {attr = 'sdv.__version__'}

[tool.bumpversion]
current_version = "1.12.2.dev0"
current_version = "1.13.2.dev0"
parse = '(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(\.(?P<release>[a-z]+)(?P<candidate>\d+))?'
serialize = [
'{major}.{minor}.{patch}.{release}{candidate}',
Expand Down
2 changes: 1 addition & 1 deletion sdv/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

__author__ = 'DataCebo, Inc.'
__email__ = '[email protected]'
__version__ = '1.12.2.dev0'
__version__ = '1.13.2.dev0'


import sys
Expand Down
2 changes: 1 addition & 1 deletion sdv/datasets/demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ def _get_data(modality, output_folder_name, in_memory_directory):
for filename, file_ in in_memory_directory.items():
if filename.endswith('.csv'):
table_name = Path(filename).stem
data[table_name] = pd.read_csv(io.StringIO(file_.decode()))
data[table_name] = pd.read_csv(io.StringIO(file_.decode()), low_memory=False)

if modality != 'multi_table':
data = data.popitem()[1]
Expand Down
4 changes: 3 additions & 1 deletion sdv/logging/__init__.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
"""Module for configuring loggers within the SDV library."""

from sdv.logging.logger import get_sdv_logger
from sdv.logging.utils import disable_single_table_logger, get_sdv_logger_config
from sdv.logging.utils import (
disable_single_table_logger, get_sdv_logger_config, load_logfile_dataframe)

__all__ = (
'disable_single_table_logger',
'get_sdv_logger',
'get_sdv_logger_config',
'load_logfile_dataframe'
)
31 changes: 29 additions & 2 deletions sdv/logging/logger.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,35 @@
"""SDV Logger."""

import csv
import logging
from functools import lru_cache
from io import StringIO

from sdv.logging.utils import get_sdv_logger_config


class CSVFormatter(logging.Formatter):
"""Logging formatter to convert to CSV."""

def __init__(self):
super().__init__()
self.output = StringIO()
headers = [
'LEVEL', 'EVENT', 'TIMESTAMP', 'SYNTHESIZER CLASS NAME', 'SYNTHESIZER ID',
'TOTAL NUMBER OF TABLES', 'TOTAL NUMBER OF ROWS', 'TOTAL NUMBER OF COLUMNS'
]
self.writer = csv.DictWriter(self.output, headers)

def format(self, record): # noqa: A003
"""Format the record and write to CSV."""
row = record.msg
row['LEVEL'] = record.levelname
self.writer.writerow(row)
data = self.output.getvalue()
self.output.truncate(0)
self.output.seek(0)
return data.strip()


@lru_cache()
def get_sdv_logger(logger_name):
"""Get a logger instance with the specified name and configuration.
Expand Down Expand Up @@ -38,7 +62,10 @@ def get_sdv_logger(logger_name):
formatter = None
config = logger_conf.get('loggers').get(logger_name)
log_level = getattr(logging, config.get('level', 'INFO'))
if config.get('format'):
if config.get('formatter'):
if config.get('formatter') == 'sdv.logging.logger.CSVFormatter':
formatter = CSVFormatter()
elif config.get('format'):
formatter = logging.Formatter(config.get('format'))

logger.setLevel(log_level)
Expand Down
12 changes: 8 additions & 4 deletions sdv/logging/sdv_logger_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,24 +4,28 @@ loggers:
SingleTableSynthesizer:
level: INFO
propagate: false
formatter: sdv.logging.logger.CSVFormatter
handlers:
class: logging.FileHandler
filename: sdv_logs.log
filename: sdv_logs.csv
MultiTableSynthesizer:
level: INFO
propagate: false
formatter: sdv.logging.logger.CSVFormatter
handlers:
class: logging.FileHandler
filename: sdv_logs.log
filename: sdv_logs.csv
MultiTableMetadata:
level: INFO
propagate: false
formatter: sdv.logging.logger.CSVFormatter
handlers:
class: logging.FileHandler
filename: sdv_logs.log
filename: sdv_logs.csv
SingleTableMetadata:
level: INFO
propagate: false
formatter: sdv.logging.logger.CSVFormatter
handlers:
class: logging.FileHandler
filename: sdv_logs.log
filename: sdv_logs.csv
17 changes: 16 additions & 1 deletion sdv/logging/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import shutil
from pathlib import Path

import pandas as pd
import platformdirs
import yaml

Expand All @@ -25,7 +26,7 @@ def get_sdv_logger_config():

for logger in logger_conf.get('loggers', {}).values():
handler = logger.get('handlers', {})
if handler.get('filename') == 'sdv_logs.log':
if handler.get('filename') == 'sdv_logs.csv':
handler['filename'] = store_path / handler['filename']

return logger_conf
Expand All @@ -49,3 +50,17 @@ def disable_single_table_logger():
finally:
for handler in handlers:
single_table_logger.addHandler(handler)


def load_logfile_dataframe(logfile):
"""Load the SDV logfile as a pandas DataFrame with correct column headers.
Args:
logfile (str):
Path to the SDV log CSV file.
"""
column_names = [
'LEVEL', 'EVENT', 'TIMESTAMP', 'SYNTHESIZER CLASS NAME', 'SYNTHESIZER ID',
'TOTAL NUMBER OF TABLES', 'TOTAL NUMBER OF ROWS', 'TOTAL NUMBER OF COLUMNS'
]
return pd.read_csv(logfile, names=column_names)
6 changes: 6 additions & 0 deletions sdv/metadata/multi_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -523,6 +523,9 @@ def _detect_relationships(self):
def detect_table_from_dataframe(self, table_name, data):
"""Detect the metadata for a table from a dataframe.
This method automatically detects the ``sdtypes`` for the given ``pandas.DataFrame``,
for a specified table. All data column names are converted to strings.
Args:
table_name (str):
Name of the table to detect.
Expand All @@ -538,6 +541,9 @@ def detect_table_from_dataframe(self, table_name, data):
def detect_from_dataframes(self, data):
"""Detect the metadata for all tables in a dictionary of dataframes.
This method automatically detects the ``sdtypes`` for the given ``pandas.DataFrame``.
All data column names are converted to strings.
Args:
data (dict):
Dictionary of table names to dataframes.
Expand Down
Loading

0 comments on commit 3871c2a

Please sign in to comment.