Update various docs (#4)
* rename cli tool * remove redundant docs * update docs * update macos instructions * add badges
This commit is contained in:
32
docs/development/data-components.md
Normal file
32
docs/development/data-components.md
Normal file
@@ -0,0 +1,32 @@
|
||||
The data & data structure components include:
|
||||
|
||||
- The `Document` class.
|
||||
- The document store.
|
||||
- The vector store.
|
||||
|
||||
### Data Loader
|
||||
|
||||
- PdfLoader
|
||||
- Layout-aware with table parsing PdfLoader
|
||||
|
||||
- MathPixLoader: To use this loader, you need MathPix API key, refer to [mathpix docs](https://docs.mathpix.com/#introduction) for more information
|
||||
- OCRLoader: This loader uses lib-table and Flax pipeline to perform OCR and read table structure from PDF file (TODO: add more info about deployment of this module).
|
||||
- Output:
|
||||
|
||||
- Document: text + metadata to identify whether it is table or not
|
||||
|
||||
```
|
||||
- "source": source file name
|
||||
- "type": "table" or "text"
|
||||
- "table_origin": original table in markdown format (to be feed to LLM or visualize using external tools)
|
||||
- "page_label": page number in the original PDF document
|
||||
```
|
||||
|
||||
### Document Store
|
||||
|
||||
- InMemoryDocumentStore
|
||||
|
||||
### Vector Store
|
||||
|
||||
- ChromaVectorStore
|
||||
- InMemoryVectorStore
|
Reference in New Issue
Block a user