Update various docs (#4)

* rename cli tool

* remove redundant docs

* update docs

* update macos instructions

* add badges
This commit is contained in:
ian_Cin
2024-03-29 19:47:03 +07:00
committed by GitHub
parent 556c48b259
commit a3bf728400
23 changed files with 339 additions and 415 deletions

View File

@@ -0,0 +1,32 @@
The data & data structure components include:
- The `Document` class.
- The document store.
- The vector store.
### Data Loader
- PdfLoader
- Layout-aware with table parsing PdfLoader
- MathPixLoader: To use this loader, you need MathPix API key, refer to [mathpix docs](https://docs.mathpix.com/#introduction) for more information
- OCRLoader: This loader uses lib-table and Flax pipeline to perform OCR and read table structure from PDF file (TODO: add more info about deployment of this module).
- Output:
- Document: text + metadata to identify whether it is table or not
```
- "source": source file name
- "type": "table" or "text"
- "table_origin": original table in markdown format (to be feed to LLM or visualize using external tools)
- "page_label": page number in the original PDF document
```
### Document Store
- InMemoryDocumentStore
### Vector Store
- ChromaVectorStore
- InMemoryVectorStore