File Converter Module¶
pipelines.pipelines.nodes.file_converter.docx ¶
DocxToTextConverter ¶
Source code in pipelines/pipelines/nodes/file_converter/docx.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 | |
__init__ ¶
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
remove_numeric_tables |
bool
|
This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option. |
False
|
valid_languages |
Optional[List[str]]
|
validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text. |
None
|
Source code in pipelines/pipelines/nodes/file_converter/docx.py
convert ¶
convert(file_path: Path, meta: Optional[Dict[str, Any]] = None, remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = None) -> List[Dict[str, Any]]
Extract text from a .docx file. Note: As docx doesn't contain "page" information, we actually extract and return a list of paragraphs here. For compliance with other converters we nevertheless opted for keeping the methods name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path |
Path
|
Path to the .docx file you want to convert |
required |
meta |
Optional[Dict[str, Any]]
|
dictionary of meta data key-value pairs to append in the returned document. |
None
|
remove_numeric_tables |
Optional[bool]
|
This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option. |
None
|
valid_languages |
Optional[List[str]]
|
validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text. |
None
|
encoding |
Optional[str]
|
Not applicable |
None
|
Source code in pipelines/pipelines/nodes/file_converter/docx.py
get_image_list ¶
Extract images from paragraph and document object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document |
Document
|
file objects |
required |
paragraph |
Paragraph
|
image paragraph |
required |
Source code in pipelines/pipelines/nodes/file_converter/docx.py
save_images ¶
Save the parsed image into desc_path
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image_list |
image files from the docx file |
required |
Source code in pipelines/pipelines/nodes/file_converter/docx.py
DocxTotxtConverter ¶
Source code in pipelines/pipelines/nodes/file_converter/docx.py
convert ¶
Extract text from a .docx file.
Source code in pipelines/pipelines/nodes/file_converter/docx.py
pipelines.pipelines.nodes.file_converter.image ¶
ImageToTextConverter ¶
Source code in pipelines/pipelines/nodes/file_converter/image.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 | |
__init__ ¶
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
remove_numeric_tables |
bool
|
This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option. |
False
|
valid_languages |
Optional[List[str]]
|
validate languages from a list of languages specified here (https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html) This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text. Run the following line of code to check available language packs: # List of available languages print(pytesseract.get_languages(config='')) |
['eng']
|
Source code in pipelines/pipelines/nodes/file_converter/image.py
convert ¶
convert(file_path: Path, meta: Optional[Dict[str, str]] = None, remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = 'utf-8', **kwargs: Any) -> List[Dict[str, Any]]
Extract text from image file using the pytesseract library (https://github.com/madmaze/pytesseract)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path |
Path
|
path to image file |
required |
meta |
Optional[Dict[str, str]]
|
Optional dictionary with metadata that shall be attached to all resulting documents. Can be any custom keys and values. |
None
|
remove_numeric_tables |
Optional[bool]
|
This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option. |
None
|
valid_languages |
Optional[List[str]]
|
validate languages from a list of languages supported by tessarect (https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html). This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text. |
None
|
Source code in pipelines/pipelines/nodes/file_converter/image.py
pipelines.pipelines.nodes.file_converter.markdown ¶
MarkdownConverter ¶
Source code in pipelines/pipelines/nodes/file_converter/markdown.py
convert ¶
convert(file_path: Path, meta: Optional[Dict[str, str]] = None, remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = 'utf-8') -> List[Dict[str, Any]]
Reads text from a txt file and executes optional preprocessing steps.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path |
Path
|
path of the file to convert |
required |
meta |
Optional[Dict[str, str]]
|
dictionary of meta data key-value pairs to append in the returned document. |
None
|
encoding |
Optional[str]
|
Select the file encoding (default is |
'utf-8'
|
remove_numeric_tables |
Optional[bool]
|
Not applicable |
None
|
valid_languages |
Optional[List[str]]
|
Not applicable |
None
|
Returns:
| Type | Description |
|---|---|
List[Dict[str, Any]]
|
Dict of format {"text": "The text from file", "meta": meta}} |
Source code in pipelines/pipelines/nodes/file_converter/markdown.py
markdown_to_text
staticmethod
¶
Converts a markdown string to plaintext
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
markdown_string |
str
|
String in markdown format |
required |
Source code in pipelines/pipelines/nodes/file_converter/markdown.py
MarkdownRawTextConverter ¶
Source code in pipelines/pipelines/nodes/file_converter/markdown.py
convert ¶
convert(file_path: Path, meta: Optional[Dict[str, str]] = None, remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = 'utf-8', **kwargs: Any) -> List[Dict[str, Any]]
Reads text from a txt file and executes optional preprocessing steps.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path |
Path
|
path of the file to convert |
required |
meta |
Optional[Dict[str, str]]
|
dictionary of meta data key-value pairs to append in the returned document. |
None
|
encoding |
Optional[str]
|
Select the file encoding (default is |
'utf-8'
|
remove_numeric_tables |
Optional[bool]
|
Not applicable |
None
|
valid_languages |
Optional[List[str]]
|
Not applicable |
None
|
Returns:
| Type | Description |
|---|---|
List[Dict[str, Any]]
|
Dict of format {"text": "The text from file", "meta": meta}} |
Source code in pipelines/pipelines/nodes/file_converter/markdown.py
pipelines.pipelines.nodes.file_converter.pdf ¶
PDFToTextConverter ¶
Source code in pipelines/pipelines/nodes/file_converter/pdf.py
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | |
__init__ ¶
__init__(remove_numeric_tables: bool = False, language: str = 'en', valid_languages: Optional[List[str]] = None)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
remove_numeric_tables |
bool
|
This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option. |
False
|
valid_languages |
Optional[List[str]]
|
validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text. |
None
|
Source code in pipelines/pipelines/nodes/file_converter/pdf.py
convert ¶
convert(file_path: Path, process_num: int = 20, meta: Optional[Dict[str, str]] = None, remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, language: Optional[str] = 'en', **kwargs: Any) -> List[Dict[str, Any]]
Extract text from a .pdf file using the pypdf library (https://pybrary.net/pyPdf/)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path |
Path
|
Path to the .pdf file you want to convert |
required |
process_num |
int
|
Number of processes |
20
|
meta |
Optional[Dict[str, str]]
|
Optional dictionary with metadata that shall be attached to all resulting documents. Can be any custom keys and values. |
None
|
remove_numeric_tables |
Optional[bool]
|
This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option. |
None
|
valid_languages |
Optional[List[str]]
|
validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text. |
None
|
Source code in pipelines/pipelines/nodes/file_converter/pdf.py
PDFToTextOCRConverter ¶
Source code in pipelines/pipelines/nodes/file_converter/pdf.py
__init__ ¶
Extract text from image file using the pytesseract library (https://github.com/madmaze/pytesseract)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
remove_numeric_tables |
bool
|
This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option. |
False
|
valid_languages |
Optional[List[str]]
|
validate languages from a list of languages supported by tessarect (https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html). This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text. |
['eng']
|
Source code in pipelines/pipelines/nodes/file_converter/pdf.py
convert ¶
convert(file_path: Path, meta: Optional[Dict[str, str]] = None, remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = 'utf-8') -> List[Dict[str, Any]]
Convert a file to a dictionary containing the text and any associated meta data.
File converters may extract file meta like name or size. In addition to it, user supplied meta data like author, url, external IDs can be supplied as a dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path |
Path
|
path of the file to convert |
required |
meta |
Optional[Dict[str, str]]
|
dictionary of meta data key-value pairs to append in the returned document. |
None
|
remove_numeric_tables |
Optional[bool]
|
This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option. |
None
|
valid_languages |
Optional[List[str]]
|
validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text. |
None
|
encoding |
Optional[str]
|
Select the file encoding (default is |
'utf-8'
|
Source code in pipelines/pipelines/nodes/file_converter/pdf.py
pipelines.pipelines.nodes.file_converter.txt ¶
TextConverter ¶
Source code in pipelines/pipelines/nodes/file_converter/txt.py
convert ¶
convert(file_path: Path, meta: Optional[Dict[str, str]] = None, remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = 'utf-8', **kwargs: Any) -> List[Dict[str, Any]]
Reads text from a txt file and executes optional preprocessing steps.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path |
Path
|
path of the file to convert |
required |
meta |
Optional[Dict[str, str]]
|
dictionary of meta data key-value pairs to append in the returned document. |
None
|
remove_numeric_tables |
Optional[bool]
|
This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option. |
None
|
valid_languages |
Optional[List[str]]
|
validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text. |
None
|
encoding |
Optional[str]
|
Select the file encoding (default is |
'utf-8'
|
Returns:
| Type | Description |
|---|---|
List[Dict[str, Any]]
|
Dict of format {"text": "The text from file", "meta": meta}} |