Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/Wrong parsing of html, xml code blocks in markdown #3578

Open
cgjosephlee opened this issue Aug 29, 2024 · 1 comment
Open

bug/Wrong parsing of html, xml code blocks in markdown #3578

cgjosephlee opened this issue Aug 29, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@cgjosephlee
Copy link

cgjosephlee commented Aug 29, 2024

Describe the bug
HTML and XML code blocks in markdown are not parsed properly.

Results:

HTML Example
```html
Hello, World!
This is a simple HTML example.
```
XML Example
xml <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>
```xml
```
```xml
```
  • HTML tags are not preserved.
  • XML code is malformed. The blank lines may erase the context.
  • <?xml version='1.0' encoding='UTF-8'?> line breaks the parser.
Traceback (most recent call last):
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/test.py", line 14, in <module>
    elems = partition_html(
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/documents/elements.py", line 605, in wrapper
    elements = func(*args, **kwargs)
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 706, in wrapper
    elements = func(*args, **kwargs)
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 662, in wrapper
    elements = func(*args, **kwargs)
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/partition.py", line 103, in partition_html
    elements = list(
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/lang.py", line 475, in apply_lang_metadata
    elements = list(elements)
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/partition.py", line 222, in iter_elements
    yield from cls(opts)._iter_elements()
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/partition.py", line 229, in _iter_elements
    for e in self._main.iter_elements():
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/parser.py", line 361, in iter_elements
    yield from self._element_from_text_or_tail(block_item.tail or "", q)
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/parser.py", line 377, in _element_from_text_or_tail
    for node in self._iter_text_segments(text, q):
  File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/parser.py", line 421, in _iter_text_segments
    while q and q[0].is_phrasing:
AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing'

To Reproduce

## HTML Example

```html
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Sample HTML</title>
</head>
<body>
    <h1>Hello, World!</h1>
    <p>This is a simple HTML example.</p>
</body>
</html>
```

## XML Example

```xml
<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>
```

```xml

<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>

```

```xml
<?xml version='1.0' encoding='UTF-8'?>
<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>
```

Expected behavior
The content in code blocks should be preserved as it is.

Screenshots

Environment Info
0.15.7

Additional context
Since markdown is first converted to html, adding extensions=['fenced_code'] to markdown parser solves the issue. Or a better way is to make the extensions list to be a configurable parameter.

html = markdown.markdown(text, extensions=["tables"])

@cgjosephlee cgjosephlee added the bug Something isn't working label Aug 29, 2024
@MthwRobinson
Copy link
Contributor

Hi @cgjosephlee - Thanks for the report and the detailed reproduction steps. We'll take a look as soon as we're able. cc @scanny .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants