Releases · Unstructured-IO/unstructured

13 Sep 14:39

MthwRobinson

0.15.12

8b7e5bb

0.15.12 Latest

Latest

0.15.12

Enhancements

Improve pdfminer element processing Implemented splitting of pdfminer elements (groups of text chunks) into smaller bounding boxes (text lines). This prevents loss of information from the object detection model and facilitates more effective removal of duplicated pdfminer text.

Assets 2

10 Sep 12:55

MthwRobinson

0.15.10

71208ca

0.15.10

Enhancements

Enhance pdfminer element cleanup Expand removal of pdfminer elements to include those inside all non-pdfminer elements, not just tables.
Modified analysis drawing tools to dump to files and draw from dumps If the parameter analysis of the partition_pdf function is set to True, the layout for Object Detection, Pdfminer Extraction, OCR and final layouts will be dumped as json files. The drawers now accept dict (dump) objects instead of internal classes instances.
Vectorize pdfminer elements deduplication computation. Use numpy operations to compute IOU and sub-region membership instead of using simply loop. This improves the speed of deduplicating elements for pages with a lot of elements.

Features

Fixes

Assets 2

30 Aug 19:13

MthwRobinson

0.15.9

6ba8135

0.15.9

Enhancements

Features

Add support for encoding parameter in partition_csv

Assets 2

27 Aug 15:55

MthwRobinson

0.15.8

4194a07

0.15.8

Enhancements

Bump unstructured.paddleocr to 2.8.1.0.

Features

Add MixedbreadAI embedder Adds MixedbreadAI embeddings to support embedding via Mixedbread AI.

Fixes

Replace pillow-heif with pi-heif. Replaces pillow-heif with pi-heif due to more permissive licensing on the wheel for pi-heif.
Minify text_as_html from DOCX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text.
Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by filetype was incorrectly identified as a MSG file.

Assets 2

20 Aug 19:53

christinestraub

0.15.7

01dbc7b

0.15.7

Enhancements

Features

Fixes

Fix NLTK data download path to prevent nested directories. Resolved an issue where a nested "nltk_data" directory was created within the parent "nltk_data" directory when it already existed. This fix prevents errors in checking for existing downloads and loading models from NLTK data.

Assets 2

20 Aug 12:47

MthwRobinson

0.15.6

1f8030d

0.15.6

Enhancements

Features

Fixes

Bump to NLTK 3.9.x Bumps to the latest nltk version to resolve CVE.
Update CI for ingest-test-fixture-update-pr to resolve NLTK model download errors.
Synchronized text and html on TableChunk splits. When a Table element is divided during chunking to fit the chunking window, TableChunk.text corresponds exactly with the table text in TableChunk.metadata.text_as_html, .text_as_html is always parseable HTML, and the table is split on even row boundaries whenever possible.

Assets 2

16 Aug 14:35

MthwRobinson

0.15.5

fc26426

0.15.5

Enhancements

Features

Fixes

Revert to using unstructured.pytesseract fork. Due to the unavailability of some recent release versions of pytesseract on PyPI, the project now uses the unstructured.pytesseract fork to ensure stability and continued support.
Bump libreoffice verson in image. Bumps the libreoffice version to 25.2.5.2 to address CVEs.
Downgrade NLTK dependency version for compatibility. Due to the unavailability of nltk==3.8.2 on PyPI, the NLTK dependency has been downgraded to <3.8.2. This change ensures continued functionality and compatibility.

Assets 2

14 Aug 21:18

christinestraub

0.15.4

9b778e2

0.15.4

Enhancements

Features

Fixes

Resolve an installation error with pytesseract>=0.3.12 that occurred during pip install unstructured[pdf]==0.15.3.

Assets 2

14 Aug 17:23

christinestraub

0.15.3

d6a84bd

0.15.3

Enhancements

Features

Fixes

Remove the custom index URL from extra-paddleocr.in to resolve the error in the setup.py configuration.

Assets 2

13 Aug 13:40

MthwRobinson

0.15.2

7437f0a

0.15.2

Enhancements

Improve directory handling when extracting image blocks. The figures directory is no longer created when the extract_image_block_to_payload parameter is set to True.

Features

Added per-class Object Detection metrics in the evaluation. The metrics include average precision, precision, recall, and f1-score for each class in the dataset.

Fixes

Updates NLTK data file for compatibility with nltk>=3.8.2. The NLTK data file now container punkt_tab, making it possible to upgrade to nltk>=3.8.2. The nltk==3.8.2 patches CVE-2024-39705.
Renames Astra to Astra DB Conforms with DataStax internal naming conventions.
Accommodate single-column CSV files. Resolves a limitation of partition_csv() where delimiter detection would fail on a single-column CSV file (which naturally has no delimeters).
Accommodate image/jpg in PPTX as alias for image/jpeg. Resolves problem partitioning PPTX files having an invalid image/jpg (should be image/jpeg) MIME-type in the [Content_Types].xml member of the PPTX Zip archive.
Fixes an issue in Object Detection metrics The issue was in preprocessing/validating the ground truth and predicted data for object detection metrics.
Removes dependency on unstructured.pytesseract Unstructured forked pytesseract while waiting for code to be upstreamed. Now that the new version has been released, this fork can be removed.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.15.12

Enhancements

0.15.10

Enhancements

Features

Fixes

0.15.9

Enhancements

Features

0.15.8

Enhancements

Features

Fixes

0.15.7

Enhancements

Features

Fixes

0.15.6

Enhancements

Features

Fixes

0.15.5

Enhancements

Features

Fixes

0.15.4

Enhancements

Features

Fixes

0.15.3

Enhancements

Features

Fixes

0.15.2

Enhancements

Features

Fixes

Releases: Unstructured-IO/unstructured

0.15.12

0.15.12

Enhancements

0.15.10

0.15.10

Enhancements

Features

Fixes

0.15.9

0.15.9

Enhancements

Features

0.15.8

0.15.8

Enhancements

Features

Fixes

0.15.7

0.15.7

Enhancements

Features

Fixes

0.15.6

0.15.6

Enhancements

Features

Fixes

0.15.5

0.15.5

Enhancements

Features

Fixes

0.15.4

0.15.4

Enhancements

Features

Fixes

0.15.3

0.15.3

Enhancements

Features

Fixes

0.15.2

0.15.2

Enhancements

Features

Fixes