Retriever Module¶
pipelines.pipelines.nodes.retriever.dense ¶
DensePassageRetriever ¶
Retriever that uses a bi-encoder (one transformer for query, one transformer for passage).
Source code in pipelines/pipelines/nodes/retriever/dense.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 | |
__init__ ¶
__init__(document_store: BaseDocumentStore, query_embedding_model: Union[Path, str] = 'rocketqa-zh-dureader-query-encoder', passage_embedding_model: Union[Path, str] = 'rocketqa-zh-dureader-para-encoder', params_path: Optional[str] = '', model_version: Optional[str] = None, output_emb_size: Optional[int] = None, reinitialize: bool = False, share_parameters: bool = False, max_seq_len_query: int = 64, max_seq_len_passage: int = 384, top_k: int = 10, use_gpu: bool = True, batch_size: int = 16, embed_title: bool = True, similarity_function: str = 'dot_product', progress_bar: bool = True, mode: Literal['snippets', 'raw_documents', 'preprocessed_documents'] = 'preprocessed_documents', **kwargs)
Init the Retriever incl. the two encoder models from a local or remote model checkpoint.
Example:
```python
| # remote model from FAIR
| DensePassageRetriever(document_store=your_doc_store,
| query_embedding_model="rocketqa-zh-dureader-query-encoder",
| passage_embedding_model="rocketqa-zh-dureader-para-encoder")
| # or from local path
| DensePassageRetriever(document_store=your_doc_store,
| query_embedding_model="model_directory/question-encoder",
| passage_embedding_model="model_directory/context-encoder")
```
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document_store |
BaseDocumentStore
|
An instance of DocumentStore from which to retrieve documents. |
required |
query_embedding_model |
Union[Path, str]
|
Local path or remote name of question encoder checkpoint. The format equals the one used by paddlenlp transformers' models Currently available remote names: |
'rocketqa-zh-dureader-query-encoder'
|
passage_embedding_model |
Union[Path, str]
|
Local path or remote name of passage encoder checkpoint. The format equals the one used by paddlenlp transformers' models Currently available remote names: |
'rocketqa-zh-dureader-para-encoder'
|
max_seq_len_query |
int
|
Longest length of each query sequence. Maximum number of tokens for the query text. Longer ones will be cut down." |
64
|
max_seq_len_passage |
int
|
Longest length of each passage/context sequence. Maximum number of tokens for the passage text. Longer ones will be cut down." |
384
|
top_k |
int
|
How many documents to return per query. |
10
|
use_gpu |
bool
|
Whether to use all available GPUs or the CPU. Falls back on CPU if no GPU is available. |
True
|
batch_size |
int
|
Number of questions or passages to encode at once. In case of multiple gpus, this will be the total batch size. |
16
|
embed_title |
bool
|
Whether to concatenate title and passage to a text pair that is then used to create the embedding. This is the approach used in the original paper and is likely to improve performance if your titles contain meaningful information for retrieval (topic, entities etc.) . The title is expected to be present in doc.meta["name"] and can be supplied in the documents before writing them to the DocumentStore like this: {"text": "my text", "meta": {"name": "my title"}}. |
True
|
similarity_function |
str
|
Which function to apply for calculating the similarity of query and passage embeddings during training. Options: |
'dot_product'
|
progress_bar |
bool
|
Whether to show a tqdm progress bar or not. Can be helpful to disable in production deployments to keep the logs clean. |
True
|
Source code in pipelines/pipelines/nodes/retriever/dense.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 | |
embed_documents ¶
Create embeddings for a list of documents using the passage encoder
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
docs |
List[Document]
|
List of Document objects used to represent documents / passages in a standardized way within pipelines. |
required |
Returns:
| Type | Description |
|---|---|
List[ndarray]
|
Embeddings of documents / passages shape (batch_size, embedding_dim) |
Source code in pipelines/pipelines/nodes/retriever/dense.py
embed_queries ¶
Create embeddings for a list of queries using the query encoder
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts |
List[str]
|
Queries to embed |
required |
Returns:
| Type | Description |
|---|---|
List[ndarray]
|
Embeddings, one per input queries |
Source code in pipelines/pipelines/nodes/retriever/dense.py
retrieve ¶
retrieve(query: str, query_type: Optional[ContentTypes] = None, filters: dict = None, top_k: Optional[int] = None, index: str = None, headers: Optional[Dict[str, str]] = None, **kwargs) -> List[Document]
Scan through documents in DocumentStore and return a small number documents that are most relevant to the query.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query |
str
|
The query |
required |
filters |
dict
|
A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field |
None
|
top_k |
Optional[int]
|
How many documents to return per query. |
None
|
index |
str
|
The name of the index in the DocumentStore from which to retrieve documents |
None
|
Source code in pipelines/pipelines/nodes/retriever/dense.py
DenseRetriever ¶
Base class for all dense retrievers.
Source code in pipelines/pipelines/nodes/retriever/dense.py
embed_documents
abstractmethod
¶
Create embeddings for a list of documents.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
documents |
List[Document]
|
List of documents to embed. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Embeddings of documents, one per input document, shape: (documents, embedding_dim) |
Source code in pipelines/pipelines/nodes/retriever/dense.py
embed_queries
abstractmethod
¶
Create embeddings for a list of queries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
queries |
List[str]
|
List of queries to embed. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Embeddings, one per input query, shape: (queries, embedding_dim) |
Source code in pipelines/pipelines/nodes/retriever/dense.py
EmbeddingRetriever ¶
Retriever that uses a bi-encoder (query model for query, passage model for passage).
Source code in pipelines/pipelines/nodes/retriever/dense.py
409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 | |
__init__ ¶
__init__(document_store: BaseDocumentStore, embedding_model: Union[Path, str] = 'ernie-embedding-v1', max_seq_len: int = 384, top_k: int = 10, batch_size: int = 16, embed_title: bool = True, similarity_function: str = 'dot_product', api_key: Optional[str] = None, secret_key: Optional[str] = None, scale_score: bool = True, progress_bar: bool = True, embed_meta_fields: Optional[List[str]] = None, mode: Literal['snippets', 'raw_documents', 'preprocessed_documents'] = 'preprocessed_documents', **kwargs)
Init the Retriever incl. the two encoder models from a local or remote model checkpoint.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document_store |
BaseDocumentStore
|
An instance of DocumentStore from which to retrieve documents. |
required |
embedding_model |
Union[Path, str]
|
Local path or remote name of question encoder checkpoint. The format equals the one used by paddlenlp transformers' models Currently available remote names: |
'ernie-embedding-v1'
|
top_k |
int
|
How many documents to return per query. |
10
|
batch_size |
int
|
Number of questions or passages to encode at once. In case of multiple gpus, this will be the total batch size. |
16
|
embed_title |
bool
|
Whether to concatenate title and passage to a text pair that is then used to create the embedding. This is the approach used in the original paper and is likely to improve performance if your titles contain meaningful information for retrieval (topic, entities etc.) . The title is expected to be present in doc.meta["name"] and can be supplied in the documents before writing them to the DocumentStore like this: {"text": "my text", "meta": {"name": "my title"}}. |
True
|
similarity_function |
str
|
Which function to apply for calculating the similarity of query and passage embeddings during training. Options: |
'dot_product'
|
progress_bar |
bool
|
Whether to show a tqdm progress bar or not. Can be helpful to disable in production deployments to keep the logs clean. |
True
|
Source code in pipelines/pipelines/nodes/retriever/dense.py
embed_documents ¶
Create embeddings for a list of documents.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
documents |
List[Document]
|
List of documents to embed. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Embeddings, one per input document, shape: (docs, embedding_dim) |
Source code in pipelines/pipelines/nodes/retriever/dense.py
embed_queries ¶
Create embeddings for a list of queries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
queries |
List[str]
|
List of queries to embed. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Embeddings, one per input query, shape: (queries, embedding_dim) |
Source code in pipelines/pipelines/nodes/retriever/dense.py
retrieve ¶
retrieve(query: str, filters: Optional[FilterType] = None, top_k: Optional[int] = None, index: Optional[str] = None, headers: Optional[Dict[str, str]] = None, scale_score: Optional[bool] = None, document_store: Optional[BaseDocumentStore] = None) -> List[Document]
Scan through the documents in a DocumentStore and return a small number of documents that are most relevant to the query.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query |
str
|
The query |
required |
filters |
Optional[FilterType]
|
Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ( |
None
|
top_k |
Optional[int]
|
How many documents to return per query. |
None
|
index |
Optional[str]
|
The name of the index in the DocumentStore from which to retrieve documents |
None
|
headers |
Optional[Dict[str, str]]
|
Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic API_KEY'} for basic authentication) |
None
|
scale_score |
Optional[bool]
|
Whether to scale the similarity score to the unit interval (range of [0,1]). If true similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used. |
None
|
document_store |
Optional[BaseDocumentStore]
|
the docstore to use for retrieval. If |
None
|
Source code in pipelines/pipelines/nodes/retriever/dense.py
472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 | |
retrieve_batch ¶
retrieve_batch(queries: List[str], filters: Optional[Union[FilterType, List[Optional[FilterType]]]] = None, top_k: Optional[int] = None, index: Optional[str] = None, headers: Optional[Dict[str, str]] = None, batch_size: Optional[int] = None, scale_score: Optional[bool] = None, document_store: Optional[BaseDocumentStore] = None) -> List[List[Document]]
Scan through the documents in a DocumentStore and return a small number of documents that are most relevant to the supplied queries.
Returns a list of lists of Documents (one per query).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
queries |
List[str]
|
List of query strings. |
required |
filters |
Optional[Union[FilterType, List[Optional[FilterType]]]]
|
Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Can be a single filter that will be applied to each query or a list of filters (one filter per query). Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ( |
None
|
top_k |
Optional[int]
|
How many documents to return per query. |
None
|
index |
Optional[str]
|
The name of the index in the DocumentStore from which to retrieve documents |
None
|
headers |
Optional[Dict[str, str]]
|
Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic API_KEY'} for basic authentication) |
None
|
batch_size |
Optional[int]
|
Number of queries to embed at a time. |
None
|
scale_score |
Optional[bool]
|
Whether to scale the similarity score to the unit interval (range of [0,1]). If true similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used. |
None
|
document_store |
Optional[BaseDocumentStore]
|
the docstore to use for retrieval. If |
None
|
Source code in pipelines/pipelines/nodes/retriever/dense.py
577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 | |
pipelines.pipelines.nodes.retriever.embedder ¶
MultiModalEmbedder ¶
Source code in pipelines/pipelines/nodes/retriever/embedder.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | |
__init__ ¶
__init__(embedding_models: Dict[str, Union[Path, str]], feature_extractors_params: Optional[Dict[str, Dict[str, Any]]] = None, batch_size: int = 16, embed_meta_fields: List[str] = ['name'], progress_bar: bool = True)
Init the Retriever and all its models from a local or remote model checkpoint.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
embedding_models |
Dict[str, Union[Path, str]]
|
A dictionary matching a local path or remote name of encoder checkpoint with the content type it should handle ("text", "image", etc...). Expected input format: |
required |
feature_extractors_params |
Optional[Dict[str, Dict[str, Any]]]
|
A dictionary matching a content type ("text", "image" and so on) with the parameters of its own feature extractor if the model requires one. Expected input format: |
None
|
batch_size |
int
|
Number of questions or passages to encode at once. In case of multiple GPUs, this will be the total batch size. |
16
|
embed_meta_fields |
List[str]
|
Concatenate the provided meta fields and text passage / image to a text pair that is then used to create the embedding. This is the approach used in the original paper and is likely to improve performance if your titles contain meaningful information for retrieval (topic, entities etc.). |
['name']
|
progress_bar |
bool
|
Whether to show a tqdm progress bar or not. Can be helpful to disable in production deployments to keep the logs clean. |
True
|
Source code in pipelines/pipelines/nodes/retriever/embedder.py
embed ¶
Create embeddings for a list of documents using the relevant encoder for their content type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
documents |
List[Document]
|
Documents to embed. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Embeddings, one per document, in the form of a np.array |
Source code in pipelines/pipelines/nodes/retriever/embedder.py
pipelines.pipelines.nodes.retriever.ernie_encoder ¶
ErnieEmbeddingEncoder ¶
Source code in pipelines/pipelines/nodes/retriever/ernie_encoder.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | |
pipelines.pipelines.nodes.retriever.multimodal_retriever ¶
MultiModalRetriever ¶
Source code in pipelines/pipelines/nodes/retriever/multimodal_retriever.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 | |
__init__ ¶
__init__(document_store: BaseDocumentStore, query_embedding_model: Union[Path, str], document_embedding_models: Dict[str, Union[Path, str]], query_type: str = 'text', query_feature_extractor_params: Dict[str, Any] = {'max_length': 64}, document_feature_extractors_params: Dict[str, Dict[str, Any]] = {'text': {'max_length': 256}}, top_k: int = 10, batch_size: int = 16, embed_meta_fields: List[str] = ['name'], similarity_function: str = 'dot_product', progress_bar: bool = True, scale_score: bool = True)
Retriever that uses a multiple encoder to jointly retrieve among a database consisting of different data types.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document_store |
BaseDocumentStore
|
An instance of DocumentStore from which to retrieve documents. |
required |
query_embedding_model |
Union[Path, str]
|
Local path or remote name of question encoder checkpoint. The format equals the one used by Hugging Face transformers' modelhub models. |
required |
document_embedding_models |
Dict[str, Union[Path, str]]
|
Dictionary matching a local path or remote name of document encoder checkpoint with the content type it should handle ("text", "table", "image", and so on). The format equals the one used by Hugging Face transformers' modelhub models. |
required |
query_type |
str
|
The content type of the query ("text", "image" and so on). |
'text'
|
query_feature_extraction_params |
The parameters to pass to the feature extractor of the query. |
required | |
document_feature_extraction_params |
The parameters to pass to the feature extractor of the documents. |
required | |
top_k |
int
|
How many documents to return per query. |
10
|
batch_size |
int
|
Number of questions or documents to encode at once. For multiple GPUs, this is the total batch size. |
16
|
embed_meta_fields |
List[str]
|
Concatenate the provided meta fields to a (text) pair that is then used to create the embedding. This is likely to improve performance if your titles contain meaningful information for retrieval (topic, entities, and so on). Note that only text and table documents support this feature. |
['name']
|
similarity_function |
str
|
Which function to apply for calculating the similarity of query and document embeddings during training. Options: |
'dot_product'
|
progress_bar |
bool
|
Whether to show a tqdm progress bar or not. Can be helpful to disable in production deployments to keep the logs clean. |
True
|
scale_score |
bool
|
Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range are scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (for example, cosine or dot_product) are used. |
True
|
Source code in pipelines/pipelines/nodes/retriever/multimodal_retriever.py
retrieve ¶
retrieve(query: Any, query_type: Optional[ContentTypes] = None, filters: Optional[FilterType] = None, top_k: Optional[int] = None, index: Optional[str] = None, headers: Optional[Dict[str, str]] = None, scale_score: Optional[bool] = None, document_store: Optional[BaseDocumentStore] = None) -> List[Document]
Scan through documents in DocumentStore and return a small number of documents that are most relevant to the supplied query. Returns a list of Documents.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query |
Any
|
Query value. It might be text, a path, a table, and so on. |
required |
query_type |
Optional[ContentTypes]
|
Type of the query ("text", "table", "image" and so on). |
None
|
filters |
Optional[FilterType]
|
Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. It can be a single filter applied to each query or a list of filters (one filter per query). |
None
|
top_k |
Optional[int]
|
How many documents to return per query. Must be > 0. |
None
|
index |
Optional[str]
|
The name of the index in the DocumentStore from which to retrieve documents. |
None
|
batch_size |
Number of queries to embed at a time. Must be > 0. |
required | |
scale_score |
Optional[bool]
|
Whether to scale the similarity score to the unit interval (range of [0,1]). If true, similarity scores (for example, cosine or dot_product) which naturally have a different value range is scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (for example, cosine or dot_product) are used. |
None
|
Source code in pipelines/pipelines/nodes/retriever/multimodal_retriever.py
retrieve_batch ¶
retrieve_batch(queries: List[Any], queries_type: Optional[ContentTypes] = None, filters: Optional[Union[FilterType, List[Optional[FilterType]]]] = None, top_k: Optional[int] = None, index: Optional[str] = None, headers: Optional[Dict[str, str]] = None, batch_size: Optional[int] = None, scale_score: Optional[bool] = None, document_store: Optional[BaseDocumentStore] = None) -> List[List[Document]]
Scan through documents in DocumentStore and return a small number of documents that are most relevant to the
supplied queries. Returns a list of lists of Documents (one list per query).
This method assumes all queries are of the same data type. Mixed-type query batches (for example one image and one text)
are currently not supported. Group the queries by type and call retrieve() on uniform batches only.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
queries |
List[Any]
|
List of query values. They might be text, paths, tables, and so on. |
required |
queries_type |
Optional[ContentTypes]
|
Type of the query ("text", "table", "image" and so on) |
None
|
filters |
Optional[Union[FilterType, List[Optional[FilterType]]]]
|
Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. It can be a single filter that will be applied to each query or a list of filters (one filter per query). |
None
|
top_k |
Optional[int]
|
How many documents to return per query. Must be > 0. |
None
|
index |
Optional[str]
|
The name of the index in the DocumentStore from which to retrieve documents. |
None
|
batch_size |
Optional[int]
|
Number of queries to embed at a time. Must be > 0. |
None
|
scale_score |
Optional[bool]
|
Whether to scale the similarity score to the unit interval (range of [0,1]). If True, similarity scores (for example, cosine or dot_product) which naturally have a different value range are scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (for example, cosine or dot_product) are used. |
None
|
Source code in pipelines/pipelines/nodes/retriever/multimodal_retriever.py
pipelines.pipelines.nodes.retriever.parallel_retriever ¶
ParallelRetriever ¶
Source code in pipelines/pipelines/nodes/retriever/parallel_retriever.py
172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 | |
__init__ ¶
__init__(document_store: BaseDocumentStore, model_version: Optional[str] = None, output_emb_size: Optional[int] = None, reinitialize: bool = False, share_parameters: bool = False, max_seq_len_query: int = 64, max_seq_len_passage: int = 384, top_k: int = 10, use_gpu: bool = True, batch_size: int = 16, embed_title: bool = True, similarity_function: str = 'dot_product', progress_bar: bool = True, mode: Literal['snippets', 'raw_documents', 'preprocessed_documents'] = 'preprocessed_documents', url='0.0.0.0:8082', num_process=10, **kwargs)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url |
the port of the HTTP service |
'0.0.0.0:8082'
|
|
num_process |
the number of processes |
10
|
Source code in pipelines/pipelines/nodes/retriever/parallel_retriever.py
retrieve ¶
retrieve(query: str, query_type: Optional[ContentTypes] = None, filters: dict = None, top_k: Optional[int] = None, index: str = None, headers: Optional[Dict[str, str]] = None, **kwargs) -> List[Document]
Scan through documents in DocumentStore and return a small number documents that are most relevant to the query.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query |
str
|
The query |
required |
filters |
dict
|
A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field |
None
|
top_k |
Optional[int]
|
How many documents to return per query. |
None
|
index |
str
|
The name of the index in the DocumentStore from which to retrieve documents |
None
|
Source code in pipelines/pipelines/nodes/retriever/parallel_retriever.py
TritonRunner ¶
Source code in pipelines/pipelines/nodes/retriever/parallel_retriever.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | |
Run_query ¶
Args: inputs: list, Each value corresponds to an input name of self._input_names Returns: results: dict, {name : numpy.array}
Source code in pipelines/pipelines/nodes/retriever/parallel_retriever.py
__init__ ¶
__init__(server_url: str, model_name: str, model_version: str, verbose=False, resp_wait_s: Optional[float] = None)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
server_url |
str
|
The port of server |
required |
model_name |
str
|
The model name needs to match the name in config.txt |
required |
model_version |
str
|
Model version number |
required |
resp_wait_s |
Optional[float]
|
the response waiting time |
None
|
Source code in pipelines/pipelines/nodes/retriever/parallel_retriever.py
pipelines.pipelines.nodes.retriever.sparse ¶
BM25Retriever ¶
Source code in pipelines/pipelines/nodes/retriever/sparse.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 | |
__init__ ¶
__init__(document_store: Optional[KeywordDocumentStore] = None, top_k: int = 10, all_terms_must_match: bool = False, custom_query: Optional[str] = None)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document_store |
Optional[KeywordDocumentStore]
|
an instance of one of the following DocumentStores to retrieve from: ElasticsearchDocumentStore, OpenSearchDocumentStore and OpenDistroElasticsearchDocumentStore. If None, a document store must be passed to the retrieve method for this Retriever to work. |
None
|
all_terms_must_match |
bool
|
Whether all terms of the query must match the document. If true all query terms must be present in a document in order to be retrieved (i.e the AND operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy AND fish AND restaurant"). Otherwise at least one query term must be present in a document in order to be retrieved (i.e the OR operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy OR fish OR restaurant"). Defaults to False. |
False
|
custom_query |
Optional[str]
|
query string as per Elasticsearch DSL with a mandatory query placeholder(query). Optionally, ES |
None
|
top_k |
int
|
How many documents to return per query. |
10
|
Source code in pipelines/pipelines/nodes/retriever/sparse.py
retrieve ¶
retrieve(query: str, query_type: ContentTypes = 'text', filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None, top_k: Optional[int] = None, index: Optional[str] = None, headers: Optional[Dict[str, str]] = None, document_store: Optional[BaseDocumentStore] = None, **kwargs) -> List[Document]
Scan through documents in DocumentStore and return a small number documents that are most relevant to the query.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query |
str
|
The query |
required |
filters |
Optional[Dict[str, Union[Dict, List, str, int, float, bool]]]
|
Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ( |
None
|
top_k |
Optional[int]
|
How many documents to return per query. |
None
|
index |
Optional[str]
|
The name of the index in the DocumentStore from which to retrieve documents |
None
|
headers |
Optional[Dict[str, str]]
|
Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information. |
None
|
document_store |
Optional[BaseDocumentStore]
|
the docstore to use for retrieval. If |
None
|
Source code in pipelines/pipelines/nodes/retriever/sparse.py
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 | |
retrieve_batch ¶
retrieve_batch(queries: List[str], queries_type: ContentTypes = 'text', filters: Optional[Union[Dict[str, Union[Dict, List, str, int, float, bool]], List[Dict[str, Union[Dict, List, str, int, float, bool]]]]] = None, top_k: Optional[int] = None, index: Optional[str] = None, headers: Optional[Dict[str, str]] = None, batch_size: Optional[int] = None, document_store: Optional[BaseDocumentStore] = None) -> List[List[Document]]
Scan through documents in DocumentStore and return a small number documents that are most relevant to the supplied queries. Returns a list of lists of Documents (one per query).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
queries |
List[str]
|
List of query strings. |
required |
filters |
Optional[Union[Dict[str, Union[Dict, List, str, int, float, bool]], List[Dict[str, Union[Dict, List, str, int, float, bool]]]]]
|
Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ( |
None
|
top_k |
Optional[int]
|
How many documents to return per query. |
None
|
index |
Optional[str]
|
The name of the index in the DocumentStore from which to retrieve documents |
None
|
headers |
Optional[Dict[str, str]]
|
Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information. |
None
|
batch_size |
Optional[int]
|
Not applicable. |
None
|
document_store |
Optional[BaseDocumentStore]
|
the docstore to use for retrieval. If |
None
|
Source code in pipelines/pipelines/nodes/retriever/sparse.py
209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 | |
pipelines.pipelines.nodes.retriever.web ¶
WebRetriever ¶
WebRetriever makes it possible to query the web for relevant documents. It downloads web page results returned by WebSearch, strips HTML, and extracts raw text, which is then split into smaller documents using the optional PreProcessor.
WebRetriever operates in two modes:
- snippets mode: WebRetriever returns a list of Documents. Each Document is a snippet of the search result.
- raw_documents mode: WebRetriever returns a list of Documents. Each Document is a full website returned by the search, stripped of HTML.
- preprocessed_documents mode: WebRetriever return a list of Documents. Each Document is a preprocessed split of the full website stripped of HTML.
In the preprocessed_documents mode, after WebSearch receives the query through the run() method, it fetches the top_k URLs relevant to the query. WebSearch then downloads and processes these URLs.
The processing involves stripping HTML tags and producing
a clean, raw text wrapped in the Document objects. WebRetriever then splits raw text into Documents according to the PreProcessor settings.
Finally, WebRetriever returns the top_k preprocessed Documents.
Finding the right balance between top_k and top_p is crucial to obtain high-quality and diverse results in the document mode. To explore potential results, we recommend that you set top_k for WebSearch close to 10. However, keep in mind that setting a high top_k value results in fetching and processing numerous web pages and is heavier on the resources.
We recommend you use the default value for top_k and adjust it based on your specific use case. The default value is 5. This means WebRetriever returns at most five of the most relevant processed documents, ensuring the search results are diverse but still of high quality. To get more results, increase top_k.
Source code in pipelines/pipelines/nodes/retriever/web.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 | |
__init__ ¶
__init__(api_key: str, search_engine_provider: Union[str, SearchEngine] = 'SerpAPI', engine: Optional[str] = 'google', top_search_results: Optional[int] = 10, search_engine_kwargs: Optional[Dict[str, Any]] = None, top_k: Optional[int] = 5, mode: Literal['snippets', 'raw_documents', 'preprocessed_documents'] = 'snippets', preprocessor: Optional[PreProcessor] = None, cache_document_store: Optional[BaseDocumentStore] = None, cache_index: Optional[str] = None, cache_headers: Optional[Dict[str, str]] = None, cache_time: int = 1 * 24 * 60 * 60)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
top_k |
Optional[int]
|
Top k documents to be returned by the retriever. |
5
|
mode |
Literal['snippets', 'raw_documents', 'preprocessed_documents']
|
Whether to return snippets, raw documents, or preprocessed documents. Preprocessed documents are the default. |
'snippets'
|
preprocessor |
Optional[PreProcessor]
|
Optional PreProcessor to be used to split documents into paragraphs. If not provided, the default PreProcessor is used. |
None
|
cache_document_store |
Optional[BaseDocumentStore]
|
DocumentStore to be used to cache search results. |
None
|
cache_index |
Optional[str]
|
Index name to be used to cache search results. |
None
|
cache_headers |
Optional[Dict[str, str]]
|
Headers to be used to cache search results. |
None
|
cache_time |
int
|
Time in seconds to cache search results. Defaults to 24 hours. |
1 * 24 * 60 * 60
|
Source code in pipelines/pipelines/nodes/retriever/web.py
retrieve ¶
retrieve(query: str, top_k: Optional[int] = None, preprocessor: Optional[PreProcessor] = None, cache_document_store: Optional[BaseDocumentStore] = None, cache_index: Optional[str] = None, cache_headers: Optional[Dict[str, str]] = None, cache_time: Optional[int] = None, **kwargs) -> List[Document]
Retrieve documents based on the list of URLs from the WebSearchEngine. The documents are scraped from the web at real-time. You can then store the documents in a DocumentStore for later use. You can cache them in a DocumentStore to improve retrieval time.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query |
str
|
The query string. |
required |
top_k |
Optional[int]
|
The number of documents to be returned by the retriever. If None, the default value is used. |
None
|
preprocessor |
Optional[PreProcessor]
|
The PreProcessor to be used to split documents into paragraphs. |
None
|
cache_document_store |
Optional[BaseDocumentStore]
|
The DocumentStore to cache the documents to. |
None
|
cache_index |
Optional[str]
|
The index name to save the documents to. |
None
|
cache_headers |
Optional[Dict[str, str]]
|
The headers to save the documents to. |
None
|
cache_time |
Optional[int]
|
The time limit in seconds to check the cache. The default is 24 hours. |
None
|
Source code in pipelines/pipelines/nodes/retriever/web.py
183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 | |