Merge branch 'main' into issue_549_disconnected_schema

sdv-dev · Jun 5, 2024 · 3871c2a · 3871c2a
2 parents d04dff2 + 29cf341
commit 3871c2a
Show file tree

Hide file tree

Showing 34 changed files with 935 additions and 296 deletions.
diff --git a/.github/workflows/install.yaml b/.github/workflows/install.yaml
@@ -0,0 +1,34 @@
+name: Install Tests
+on:
+  pull_request:
+    types: [opened, synchronize]
+  push:
+    branches:
+      - main
+jobs:
+  install:
+    name: ${{ matrix.python_version }} install
+    strategy:
+      fail-fast: true
+      matrix:
+        python_version: ["3.8", "3.12"]
+    runs-on: ubuntu-latest
+    steps:
+      - name: Set up python ${{ matrix.python_version }}
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python_version }}
+      - uses: actions/checkout@v4
+      - name: Build package
+        run: |
+          make package
+      - name: Install package
+        run: |
+          python -m pip install "unpacked_sdist/."
+      - name: Test by importing packages
+        run: |
+          python -c "import sdv"
+          python -c "import sdv;print(sdv.version.public)"
+      - name: Check package conflicts
+        run: |
+          python -m pip check
diff --git a/.github/workflows/integration.yml b/.github/workflows/integration.yml
@@ -9,6 +9,7 @@ jobs:
   integration:
     runs-on: ${{ matrix.os }}
     strategy:
+      fail-fast: true
       matrix:
         python-version: [ '3.8', '3.9', '3.10', '3.11', '3.12']
         os: [ubuntu-latest, windows-latest]

diff --git a/.github/workflows/minimum.yml b/.github/workflows/minimum.yml
@@ -9,6 +9,7 @@ jobs:
   minimum:
     runs-on: ${{ matrix.os }}
     strategy:
+      fail-fast: true
       matrix:
         python-version: [ '3.8', '3.9', '3.10', '3.11', '3.12']
         os: [ubuntu-latest, windows-latest]

diff --git a/.github/workflows/unit.yml b/.github/workflows/unit.yml
@@ -9,6 +9,7 @@ jobs:
   unit:
     runs-on: ${{ matrix.os }}
     strategy:
+      fail-fast: true
       matrix:
         python-version: [ '3.8', '3.9', '3.10', '3.11', '3.12']
         os: [ubuntu-latest, windows-latest]

diff --git a/HISTORY.md b/HISTORY.md
@@ -1,5 +1,45 @@
 # Release Notes
 
+## 1.13.1 - 2024-05-16
+
+This release fixes the `ModuleNotFoundError` error that was causing the 1.13.0 release to fail.
+
+## 1.13.0 - 2024-05-15
+
+This release adds a utility function called `get_random_subset` that helps users get a subset of their multi-table data so that modeling can be done quicker. Given a dictionary of table names mapped to DataFrames, metadata, a main table and a desired number of rows to use for the main table, it will subsample the data in a way that maintains referential integrity.
+
+This release also adds two new local file handlers: the `CSVHandler` and the `ExcelHandler`. This enables users to easily load from and save synthetic data to these files types. These handlers return data and metadata in the multi-table format, so we also added the function `get_table_metadata` to get a `SingleTableMetadata` object from a `MultiTableMetadata` object.
+
+Finally, this release fixes some bugs that prevented synthesizers from working with data that had numerical column names.
+
+### New Features
+
+* Add `get_random_subset` poc utility function - Issue [#1877](https://github.com/sdv-dev/SDV/issues/1877) by @R-Palazzo
+* Add usage logging - Issue [#1903](https://github.com/sdv-dev/SDV/issues/1903) by @pvk-developer
+* Move function `drop_unknown_references` from `poc` to be directly under `utils` - Issue [#1947](https://github.com/sdv-dev/SDV/issues/1947) by @R-Palazzo
+* Add CSVHandler - Issue [#1949](https://github.com/sdv-dev/SDV/issues/1949) by @pvk-developer
+* Add ExcelHandler - Issue [#1950](https://github.com/sdv-dev/SDV/issues/1950) by @pvk-developer
+* Add get_table_metadata function - Issue [#1951](https://github.com/sdv-dev/SDV/issues/1951) by @R-Palazzo
+* Save usage log file as a csv - Issue [#1974](https://github.com/sdv-dev/SDV/issues/1974) by @frances-h
+* Split out metadata creation from data import in the local files handlers - Issue [#1975](https://github.com/sdv-dev/SDV/issues/1975) by @pvk-developer
+* Improve error message when trying to sample before fitting (single table) - Issue [#1978](https://github.com/sdv-dev/SDV/issues/1978) by @R-Palazzo
+
+### Bugs Fixed
+
+* Metadata detection crashes when the column names are integers (`AttributeError: 'int' object has no attribute 'lower'`) - Issue [#1933](https://github.com/sdv-dev/SDV/issues/1933) by @lajohn4747
+* Synthesizers crash when column names are integers (`TypeError: unsupported operand`) - Issue [#1935](https://github.com/sdv-dev/SDV/issues/1935) by @lajohn4747
+* Switch parameter order in drop_unknown_references - Issue [#1944](https://github.com/sdv-dev/SDV/issues/1944) by @R-Palazzo
+* Unexpected NaN values in sequence_index when dataframe isn't reset - Issue [#1973](https://github.com/sdv-dev/SDV/issues/1973) by @fealho
+* Fix pandas DtypeWarning in download_demo - Issue [#1980](https://github.com/sdv-dev/SDV/issues/1980) by @fealho
+
+### Maintenance
+
+* Only run unit and integration tests on oldest and latest python versions for macos - Issue [#1948](https://github.com/sdv-dev/SDV/issues/1948) by @frances-h
+
+### Internal
+
+* Update code to remove `FutureWarning` related to 'enforce_uniqueness' parameter - Issue [#1995](https://github.com/sdv-dev/SDV/issues/1995) by @pvk-developer
+
 ## 1.12.1 - 2024-04-19
 
 This release makes a number of changes to how id columns are generated. By default, id columns with a regex will now have their values scrambled in the output. Id columns without a regex that are numeric will be created randomly. If they're not numeric, they will have a random suffix.

diff --git a/Makefile b/Makefile
@@ -235,6 +235,10 @@ ifeq ($(CHANGELOG_LINES),0)
 	$(error Please insert the release notes in HISTORY.md before releasing)
 endif
 
+.PHONY: git-push
+git-push: ## Simply push the repository to github
+	git push
+
 .PHONY: check-release
 check-release: check-clean check-main check-history ## Check if the release can be made
 	@echo "A new release can be made"
@@ -261,5 +265,24 @@ release-major: check-release bumpversion-major release
 
 .PHONY: check-deps
 check-deps:
-	$(eval allow_list='cloudpickle=|graphviz=|numpy=|pandas=|tqdm=|copulas=|ctgan=|deepecho=|rdt=|sdmetrics=|platformdirs=')
+	$(eval allow_list='cloudpickle=|graphviz=|numpy=|pandas=|tqdm=|copulas=|ctgan=|deepecho=|rdt=|sdmetrics=|platformdirs=|pyyaml=')
 	pip freeze | grep -v "SDV.git" | grep -E $(allow_list) | sort > $(OUTPUT_FILEPATH)
+
+.PHONY: upgradepip
+upgradepip:
+	python -m pip install --upgrade pip
+
+.PHONY: upgradebuild
+upgradebuild:
+	python -m pip install --upgrade build
+
+.PHONY: upgradesetuptools
+upgradesetuptools:
+	python -m pip install --upgrade setuptools
+
+.PHONY: package
+package: upgradepip upgradebuild upgradesetuptools
+	python -m build ; \
+	$(eval VERSION=$(shell python -c 'import setuptools; setuptools.setup()' --version))
+	tar -zxvf "dist/sdv-${VERSION}.tar.gz"
+	mv "sdv-${VERSION}" unpacked_sdist
diff --git a/README.md b/README.md
@@ -94,12 +94,12 @@ column and the primary key (`guest_email`).
 ## Synthesizing Data
 Next, we can create an **SDV synthesizer**,  an object that you can use to create synthetic data.
 It learns patterns from the real data and replicates them to generate synthetic data. Let's use
-the `FAST_ML` preset synthesizer, which is optimized for performance.
+the [GaussianCopulaSynthesizer](https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/gaussiancopulasynthesizer).
 
 ```python
-from sdv.lite import SingleTablePreset
+from sdv.single_table import GaussianCopulaSynthesizer
 
-synthesizer = SingleTablePreset(metadata, name='FAST_ML')
+synthesizer = GaussianCopulaSynthesizer(metadata)
 synthesizer.fit(data=real_data)
 ```
 
@@ -131,11 +131,15 @@ quality_report = evaluate_quality(
 ```
 
 ```
-Creating report: 100%|██████████| 4/4 [00:00<00:00, 19.30it/s]
-Overall Quality Score: 89.12%
-Properties:
-Column Shapes: 90.27%
-Column Pair Trends: 87.97%
+Generating report ...
+
+(1/2) Evaluating Column Shapes: |████████████████| 9/9 [00:00<00:00, 1133.09it/s]|
+Column Shapes Score: 89.11%
+
+(2/2) Evaluating Column Pair Trends: |██████████████████████████████████████████| 36/36 [00:00<00:00, 502.88it/s]|
+Column Pair Trends Score: 88.3%
+
+Overall Score (Average): 88.7%
 ```
 
 This object computes an overall quality score on a scale of 0 to 100% (100 being the best) as well

diff --git a/latest_requirements.txt b/latest_requirements.txt
@@ -1,11 +1,11 @@
 cloudpickle==3.0.0
 copulas==0.11.0
-ctgan==0.10.0
+ctgan==0.10.1
 deepecho==0.6.0
 graphviz==0.20.3
 numpy==1.26.4
 pandas==2.2.2
-platformdirs==4.2.1
-rdt==1.12.0
-sdmetrics==0.14.0
+platformdirs==4.2.2
+rdt==1.12.1
+sdmetrics==0.14.1
 tqdm==4.66.4
diff --git a/pyproject.toml b/pyproject.toml
@@ -38,6 +38,7 @@ dependencies = [
     'rdt>=1.12.0',
     'sdmetrics>=0.14.0',
     'platformdirs>=4.0',
+    'pyyaml>=6.0.1',
 ]
 
 [project.urls]
@@ -157,7 +158,7 @@ namespaces = false
 version = {attr = 'sdv.__version__'}
 
 [tool.bumpversion]
-current_version = "1.12.2.dev0"
+current_version = "1.13.2.dev0"
 parse = '(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(\.(?P<release>[a-z]+)(?P<candidate>\d+))?'
 serialize = [
     '{major}.{minor}.{patch}.{release}{candidate}',

diff --git a/sdv/__init__.py b/sdv/__init__.py
@@ -6,7 +6,7 @@
 
 __author__ = 'DataCebo, Inc.'
 __email__ = '[email protected]'
-__version__ = '1.12.2.dev0'
+__version__ = '1.13.2.dev0'
 
 
 import sys

diff --git a/sdv/datasets/demo.py b/sdv/datasets/demo.py
@@ -96,7 +96,7 @@ def _get_data(modality, output_folder_name, in_memory_directory):
         for filename, file_ in in_memory_directory.items():
             if filename.endswith('.csv'):
                 table_name = Path(filename).stem
-                data[table_name] = pd.read_csv(io.StringIO(file_.decode()))
+                data[table_name] = pd.read_csv(io.StringIO(file_.decode()), low_memory=False)
 
     if modality != 'multi_table':
         data = data.popitem()[1]

diff --git a/sdv/logging/__init__.py b/sdv/logging/__init__.py
@@ -1,10 +1,12 @@
 """Module for configuring loggers within the SDV library."""
 
 from sdv.logging.logger import get_sdv_logger
-from sdv.logging.utils import disable_single_table_logger, get_sdv_logger_config
+from sdv.logging.utils import (
+    disable_single_table_logger, get_sdv_logger_config, load_logfile_dataframe)
 
 __all__ = (
     'disable_single_table_logger',
     'get_sdv_logger',
     'get_sdv_logger_config',
+    'load_logfile_dataframe'
 )
diff --git a/sdv/logging/logger.py b/sdv/logging/logger.py
@@ -1,11 +1,35 @@
 """SDV Logger."""
-
+import csv
 import logging
 from functools import lru_cache
+from io import StringIO
 
 from sdv.logging.utils import get_sdv_logger_config
 
 
+class CSVFormatter(logging.Formatter):
+    """Logging formatter to convert to CSV."""
+
+    def __init__(self):
+        super().__init__()
+        self.output = StringIO()
+        headers = [
+            'LEVEL', 'EVENT', 'TIMESTAMP', 'SYNTHESIZER CLASS NAME', 'SYNTHESIZER ID',
+            'TOTAL NUMBER OF TABLES', 'TOTAL NUMBER OF ROWS', 'TOTAL NUMBER OF COLUMNS'
+        ]
+        self.writer = csv.DictWriter(self.output, headers)
+
+    def format(self, record):  # noqa: A003
+        """Format the record and write to CSV."""
+        row = record.msg
+        row['LEVEL'] = record.levelname
+        self.writer.writerow(row)
+        data = self.output.getvalue()
+        self.output.truncate(0)
+        self.output.seek(0)
+        return data.strip()
+
+
 @lru_cache()
 def get_sdv_logger(logger_name):
     """Get a logger instance with the specified name and configuration.
@@ -38,7 +62,10 @@ def get_sdv_logger(logger_name):
             formatter = None
             config = logger_conf.get('loggers').get(logger_name)
             log_level = getattr(logging, config.get('level', 'INFO'))
-            if config.get('format'):
+            if config.get('formatter'):
+                if config.get('formatter') == 'sdv.logging.logger.CSVFormatter':
+                    formatter = CSVFormatter()
+            elif config.get('format'):
                 formatter = logging.Formatter(config.get('format'))
 
             logger.setLevel(log_level)

diff --git a/sdv/logging/sdv_logger_config.yml b/sdv/logging/sdv_logger_config.yml
@@ -4,24 +4,28 @@ loggers:
   SingleTableSynthesizer:
     level: INFO
     propagate: false
+    formatter: sdv.logging.logger.CSVFormatter
     handlers:
       class: logging.FileHandler
-      filename: sdv_logs.log
+      filename: sdv_logs.csv
   MultiTableSynthesizer:
     level: INFO
     propagate: false
+    formatter: sdv.logging.logger.CSVFormatter
     handlers:
       class: logging.FileHandler
-      filename: sdv_logs.log
+      filename: sdv_logs.csv
     MultiTableMetadata:
       level: INFO
       propagate: false
+      formatter: sdv.logging.logger.CSVFormatter
       handlers:
         class: logging.FileHandler
-        filename: sdv_logs.log
+        filename: sdv_logs.csv
     SingleTableMetadata:
       level: INFO
       propagate: false
+      formatter: sdv.logging.logger.CSVFormatter
       handlers:
         class: logging.FileHandler
-        filename: sdv_logs.log
+        filename: sdv_logs.csv
diff --git a/sdv/logging/utils.py b/sdv/logging/utils.py
@@ -5,6 +5,7 @@
 import shutil
 from pathlib import Path
 
+import pandas as pd
 import platformdirs
 import yaml
 
@@ -25,7 +26,7 @@ def get_sdv_logger_config():
 
     for logger in logger_conf.get('loggers', {}).values():
         handler = logger.get('handlers', {})
-        if handler.get('filename') == 'sdv_logs.log':
+        if handler.get('filename') == 'sdv_logs.csv':
             handler['filename'] = store_path / handler['filename']
 
     return logger_conf
@@ -49,3 +50,17 @@ def disable_single_table_logger():
     finally:
         for handler in handlers:
             single_table_logger.addHandler(handler)
+
+
+def load_logfile_dataframe(logfile):
+    """Load the SDV logfile as a pandas DataFrame with correct column headers.
+
+    Args:
+        logfile (str):
+            Path to the SDV log CSV file.
+    """
+    column_names = [
+        'LEVEL', 'EVENT', 'TIMESTAMP', 'SYNTHESIZER CLASS NAME', 'SYNTHESIZER ID',
+        'TOTAL NUMBER OF TABLES', 'TOTAL NUMBER OF ROWS', 'TOTAL NUMBER OF COLUMNS'
+    ]
+    return pd.read_csv(logfile, names=column_names)
diff --git a/sdv/metadata/multi_table.py b/sdv/metadata/multi_table.py
@@ -523,6 +523,9 @@ def _detect_relationships(self):
     def detect_table_from_dataframe(self, table_name, data):
         """Detect the metadata for a table from a dataframe.
 
+        This method automatically detects the ``sdtypes`` for the given ``pandas.DataFrame``,
+        for a specified table. All data column names are converted to strings.
+
         Args:
             table_name (str):
                 Name of the table to detect.
@@ -538,6 +541,9 @@ def detect_table_from_dataframe(self, table_name, data):
     def detect_from_dataframes(self, data):
         """Detect the metadata for all tables in a dictionary of dataframes.
 
+        This method automatically detects the ``sdtypes`` for the given ``pandas.DataFrame``.
+        All data column names are converted to strings.
+
         Args:
             data (dict):
                 Dictionary of table names to dataframes.