You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
HTML and XML code blocks in markdown are not parsed properly.
Results:
HTML Example
```html
Hello, World!
This is a simple HTML example.
```
XML Example
xml <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>
```xml
```
```xml
```
HTML tags are not preserved.
XML code is malformed. The blank lines may erase the context.
<?xml version='1.0' encoding='UTF-8'?> line breaks the parser.
Traceback (most recent call last):
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/test.py", line 14, in <module>
elems = partition_html(
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/documents/elements.py", line 605, in wrapper
elements = func(*args, **kwargs)
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 706, in wrapper
elements = func(*args, **kwargs)
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 662, in wrapper
elements = func(*args, **kwargs)
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
elements = func(*args, **kwargs)
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/partition.py", line 103, in partition_html
elements = list(
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/lang.py", line 475, in apply_lang_metadata
elements = list(elements)
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/partition.py", line 222, in iter_elements
yield from cls(opts)._iter_elements()
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/partition.py", line 229, in _iter_elements
for e in self._main.iter_elements():
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/parser.py", line 361, in iter_elements
yield from self._element_from_text_or_tail(block_item.tail or "", q)
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/parser.py", line 377, in _element_from_text_or_tail
for node in self._iter_text_segments(text, q):
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/parser.py", line 421, in _iter_text_segments
while q and q[0].is_phrasing:
AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing'
To Reproduce
## HTML Example```html
<!DOCTYPE html>
<htmllang="en">
<head>
<metacharset="UTF-8">
<metaname="viewport"content="width=device-width, initial-scale=1.0">
<title>Sample HTML</title>
</head>
<body>
<h1>Hello, World!</h1>
<p>This is a simple HTML example.</p>
</body>
</html>
```## XML Example```xml
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
``````xml
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
``````xml
<?xml version='1.0' encoding='UTF-8'?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
```
Expected behavior
The content in code blocks should be preserved as it is.
Screenshots
Environment Info
0.15.7
Additional context
Since markdown is first converted to html, adding extensions=['fenced_code'] to markdown parser solves the issue. Or a better way is to make the extensions list to be a configurable parameter.
Describe the bug
HTML and XML code blocks in markdown are not parsed properly.
Results:
<?xml version='1.0' encoding='UTF-8'?>
line breaks the parser.To Reproduce
Expected behavior
The content in code blocks should be preserved as it is.
Screenshots
Environment Info
0.15.7
Additional context
Since markdown is first converted to html, adding
extensions=['fenced_code']
to markdown parser solves the issue. Or a better way is to make the extensions list to be a configurable parameter.unstructured/unstructured/partition/md.py
Line 109 in f440eb4
The text was updated successfully, but these errors were encountered: