kotaemon

Author	SHA1	Message	Date
Tuan Anh Nguyen Dang (Tadashi_Cin)	4704e2c11a	Add new OCRReader with PDF+OCR text merging (#66 ) This change speeds up OCR extraction by allowing bypassing OCR for texts that are irrelevant (not in table). --------- Co-authored-by: Nguyen Trung Duc (john) <trungduc1992@gmail.com>	2023-11-13 17:43:02 +07:00
Nguyen Trung Duc (john)	d79b3744cb	Simplify the `BaseComponent` inteface (#64 ) This change remove `BaseComponent`'s: - run_raw - run_batch_raw - run_document - run_batch_document - is_document - is_batch Each component is expected to support multiple types of inputs and a single type of output. Since we want the component to work out-of-the-box with both standardized and customized use cases, supporting multiple types of inputs are expected. At the same time, to reduce the complexity of understanding how to use a component, we restrict a component to only have a single output type. To accommodate these changes, we also refactor some components to remove their run_raw, run_batch_raw... methods, and to decide the common output interface for those components. Tests are updated accordingly. Commit changes: * Add kwargs to vector store's query * Simplify the BaseComponent * Update tests * Remove support for Python 3.8 and 3.9 * Bump version 0.3.0 * Fix github PR caching still use old environment after bumping version --------- Co-authored-by: ian <ian@cinnamon.is>	2023-11-13 15:10:18 +07:00
ian_Cin	6095526dc7	Add Huggingface embeddings and Cohere embeddings (#63 ) * Add huggingface embeddings and cohere embeddings * Update openai interface and the mock for newer OpenAI SDK --------- Co-authored-by: trducng <trungduc1992@gmail.com>	2023-11-10 09:38:30 +07:00
Nguyen Trung Duc (john)	9035e25666	Upgrade the declarative pipeline for cleaner interface (#51 )	2023-10-24 11:12:22 +07:00
Nguyen Trung Duc (john)	aab982ddc4	Provide ready binary for Mac and Linux to do sharing tunneling (#49 )	2023-10-17 17:19:29 +07:00
ian_Cin	2b779926c6	Directly caching the python instead of creating virtual env; add option to ignore caching (#45 ) - Directly caching the python instead of creating virtual env - add option to ignore caching using `[ignore catch]` in the commit message	2023-10-16 15:27:14 +07:00
Nguyen Trung Duc (john)	da6b35f520	Allow persisting the expected output in the code (#46 ) By allowing specifying the UI outputs in the code, any time user runs `kh export ...`, that outputs in the code will be included in the UI YAML file. Otherwise, any time the user runs `kh export ...`, the output section in the UI YAML file will be reset to the default output.	2023-10-13 10:26:48 +07:00
Nguyen Trung Duc (john)	6e7905cbc0	[AUR-411] Adopt to Example2 project (#28 ) Add the chatbot from Example2. Create the UI for chat.	2023-10-12 15:13:25 +07:00
ian_Cin	533fffa6db	Enable caching for github actions (#43 )	2023-10-12 13:52:19 +07:00
ian_Cin	84f1fa8cbd	[AUR-395] Adopt Example1 disclaimer pipeline (#42 ) * Adopt Example1 disclaimer pipeline * Update Document class * Add composite components * Modify Extractor behaviours	2023-10-10 15:42:48 +07:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	79cc60e6a2	[AUR-429] Add MVP pipeline with Ingestion and QA stage (#39 ) * add base Tool * minor update test_tool * update test dependency * update test dependency * Fix namespace conflict * update test * add base Agent Interface, add ReWoo Agent * minor update * update test * fix typo * remove unneeded print * update rewoo agent * add LLMTool * update BaseAgent type * add ReAct agent * add ReAct agent * minor update * minor update * minor update * minor update * update base reader with BaseComponent * add splitter * update agent and tool * update vectorstores * update load/save for indexing and retrieving pipeline * update test_agent for more use-cases * add missing dependency for test * update test case for in memory vectorstore * add TextSplitter to BaseComponent * update type hint basetool * add insurance mvp pipeline * update requirements * Remove redundant plugins param * Mock GoogleSearch --------- Co-authored-by: trducng <trungduc1992@gmail.com>	2023-10-05 12:31:33 +07:00
ian_Cin	2638152054	[Feat] Add support for f-string syntax in PromptTemplate (#38 ) * Add support for f-string syntax in PromptTemplate	2023-10-04 16:40:09 +07:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	56bc41b673	Update Base interface of Index/Retrieval pipeline (#36 ) * add base Tool * minor update test_tool * update test dependency * update test dependency * Fix namespace conflict * update test * add base Agent Interface, add ReWoo Agent * minor update * update test * fix typo * remove unneeded print * update rewoo agent * add LLMTool * update BaseAgent type * add ReAct agent * add ReAct agent * minor update * minor update * minor update * minor update * update base reader with BaseComponent * add splitter * update agent and tool * update vectorstores * update load/save for indexing and retrieving pipeline * update test_agent for more use-cases * add missing dependency for test * update test case for in memory vectorstore * add TextSplitter to BaseComponent * update type hint basetool --------- Co-authored-by: trducng <trungduc1992@gmail.com>	2023-10-04 14:27:44 +07:00
Nguyen Trung Duc (john)	49ed3f6994	[AUR-405] Auto-generate markdown documentation from pipeline (#33 ) * Create a script to auto-generate markdown docs from pipeline * Clean up documentation for Chain-of-Thought	2023-10-04 10:54:24 +07:00
Nguyen Trung Duc (john)	6ab1854532	feat: Add chain-of-thought (#37 ) * Add chain-of-thought * Use BasePromptComponent * Add terminate callback for the chain-of-thought	2023-10-04 02:16:33 +07:00
Nguyen Trung Duc (john)	f80a4ea883	[AUR-425] Fix the cookiecutter command (#35 )	2023-10-03 12:13:10 +07:00
cin-jacky	205955b8a3	[AUR-387, AUR-425] Add start-project to CLI (#29 )	2023-10-03 11:55:34 +07:00
ian_Cin	d83c22aa4e	[AUR-395, AUR-415] Adopt Example1 Injury pipeline; add .flow() for enabling bottom-up pipeline execution (#32 ) * add example1/injury pipeline example * add dotenv * update various api	2023-10-02 16:24:56 +07:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	3cceec63ef	[AUR-431] Add ReAct Agent (#34 ) * add base Tool * minor update test_tool * update test dependency * update test dependency * Fix namespace conflict * update test * add base Agent Interface, add ReWoo Agent * minor update * update test * fix typo * remove unneeded print * update rewoo agent * add LLMTool * update BaseAgent type * add ReAct agent * add ReAct agent * minor update * minor update * minor update * minor update * update docstring * fix max_iteration --------- Co-authored-by: trducng <trungduc1992@gmail.com>	2023-10-02 11:29:12 +07:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	91048770fa	[AUR-431, AUR-435] Add Agent Interface and ReWOO Agent implementation (#31 ) * add base Tool * minor update test_tool * update test dependency * update test dependency * Fix namespace conflict * update test * add base Agent Interface, add ReWoo Agent * minor update * update test * fix typo * remove unneeded print * update rewoo agent --------- Co-authored-by: trducng <trungduc1992@gmail.com>	2023-10-01 11:53:08 +07:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	f9fc02a32a	[AUR-363, AUR-433, AUR-434] Add Base Tool interface with Wikipedia/Google tools (#30 ) * add base Tool * minor update test_tool * update test dependency * update test dependency * Fix namespace conflict * update test --------- Co-authored-by: trducng <trungduc1992@gmail.com>	2023-09-29 10:18:49 +07:00
cin-jacky	317323c0e5	[AUR-424] Setup CLI interface (#25 ) * [AUR-424] Setup CLI interface * [AUR-424] fix test_vectorstore:test_query * [AUR-424] exclude examples when setup CLI * [AUR-424] create kh and kh --export * [AUR-426] revise cli by using click.group * Fix dynamic import * [AUR-426] revert the format of import packages * [AUR-426] set argument default * [AUR-426] set click dependencies in setup.py --------- Co-authored-by: trducng <trungduc1992@gmail.com>	2023-09-27 16:44:38 +09:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	6c3d614973	[AUR-432] Add layout-aware table parsing PDF reader (#27 ) * add OCRReader, MathPixReader and ExcelReader * update test case for ocr reader * reformat * minor fix	2023-09-26 15:52:44 +07:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	6207f4332a	[AUR-430] Add test case for Chroma VectoStore save/load (#26 ) * add test case for Chroma save/load * minor name change * add delete_collection support for chroma * move save load to chroma --------- Co-authored-by: Nguyen Trung Duc (john) <john@cinnamon.is>	2023-09-26 10:58:41 +07:00
Nguyen Trung Duc (john)	4f189dc931	[AUR-408] Export logs to Excel (#23 ) This CL implements: - The logic to export log to Excel. - Route the export logic in the UI. - Demonstrate this functionality in `./examples/promptui` project.	2023-09-25 17:20:03 +07:00
ian_Cin	08b6e5d3fb	[AUR-390] Add prompt template and prompt component (#24 ) * Export pipeline to config * Export the input to config * Preliminary creating UI dynamically * Add test for config export * Try out prompt UI * Add example projects * Fix test errors * Standardize interface for retrieval * Finalize the UI demo * Update README.md * Update README * Refactor according to main * Fix typing issue * Add openai key to git-secret * Add prompt template and prompt component * Update test * update tests * revert docstring --------- Co-authored-by: trducng <trungduc1992@gmail.com> Co-authored-by: Nguyen Trung Duc (john) <john@cinnamon.is>	2023-09-25 14:38:22 +07:00
Nguyen Trung Duc (john)	c6dd01e820	[AUR-338, AUR-406, AUR-407] Export pipeline to config for PromptUI. Construct PromptUI dynamically based on config. (#16 ) From pipeline > config > UI. Provide example project for promptui - Pipeline to config: `kotaemon.contribs.promptui.config.export_pipeline_to_config`. The config follows schema specified in this document: https://cinnamon-ai.atlassian.net/wiki/spaces/ATM/pages/2748711193/Technical+Detail. Note: this implementation exclude the logs, which will be handled in AUR-408. - Config to UI: `kotaemon.contribs.promptui.build_from_yaml` - Example project is located at `examples/promptui/`	2023-09-21 14:27:23 +07:00
cin-jacky	c329c4c03f	[AUR-362] Add In-memory vector store (#22 ) * [AUR-362] Add In-memory vector store * [AUR-362] fix delete fun input format * [AUR-362] revise persist and from persist path to save and load * [AUR-362] revise simple.py to in_memory.py	2023-09-20 17:51:50 +09:00
ian_Cin	b794051653	[AUR-421] base output post-processor that works using regex. (#20 )	2023-09-19 19:54:44 +07:00
Nguyen Trung Duc (john)	2a3a23ecd7	[AUR-420] Provide document store base interface and an in-memory version (#21 ) Document store handles storing and indexing Documents. It supports the following interfaces: - add: add 1 or more documents into document store - get: get a list of documents - get_all: get all documents in a document store - delete: delete 1 or more document - save: persist a document store into disk - load: load a document store from disk	2023-09-19 14:49:23 +07:00
Nguyen Trung Duc (john)	620b2b03ca	[AUR-392, AUR-413, AUR-414] Define base vector store, and make use of ChromaVectorStore from llama_index. Indexing and retrieving vectors with vector store (#18 ) Design the base interface of vector store, and apply it to the Chroma Vector Store (wrapped around llama_index's implementation). Provide the pipelines to populate and retrieve from vector store.	2023-09-14 14:18:20 +07:00
Nguyen Trung Duc (john)	c339912312	[AUR-389] Add base interface and embedding model (#17 ) This change provides the base interface of an embedding, and wrap the Langchain's OpenAI embedding. Usage as follow: ```python from kotaemon.embeddings import AzureOpenAIEmbeddings model = AzureOpenAIEmbeddings( model="text-embedding-ada-002", deployment="embedding-deployment", openai_api_base="https://test.openai.azure.com/", openai_api_key="some-key", ) output = model("Hello world") ```	2023-09-14 14:08:58 +07:00
ian_Cin	1061192731	[AUR-418] Add member public keys to git-secret: John, Ian, Tadashi, Jacky	2023-09-06 17:19:22 +07:00
trducng	f4596aa720	Fix import	2023-09-04 10:30:53 +07:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	21350153d4	[AUR-391, AUR-393] Add Document and DocumentReader base (#6 ) * Declare BaseComponent * Brainstorming base class for LLM call * Define base LLM * Add tests * Clean telemetry environment for accurate testing * Fix README * Fix typing * add base document reader * update test * update requirements * Cosmetic change * update requirements * reformat --------- Co-authored-by: trducng <trungduc1992@gmail.com>	2023-08-31 11:24:12 +07:00
Nguyen Trung Duc (john)	4211315a54	[AUR-396] Scaffold prompt engineering code base section (#5 )	2023-08-30 14:31:21 +07:00
ian_Cin	5241edbc46	[AUR-361] Setup pre-commit, pytest, GitHub actions, ssh-secret (#3 ) Co-authored-by: trducng <trungduc1992@gmail.com>	2023-08-30 07:22:01 +07:00
Nguyen Trung Duc (john)	c3c25db48c	[AUR-385, AUR-388] Declare BaseComponent and decide LLM call interface (#2 ) - Use cases related to LLM call: https://cinnamon-ai.atlassian.net/browse/AUR-388?focusedCommentId=34873 - Sample usages: `test_llms_chat_models.py` and `test_llms_completion_models.py`: ```python from kotaemon.llms.chats.openai import AzureChatOpenAI model = AzureChatOpenAI( openai_api_base="https://test.openai.azure.com/", openai_api_key="some-key", openai_api_version="2023-03-15-preview", deployment_name="gpt35turbo", temperature=0, request_timeout=60, ) output = model("hello world") ``` For the LLM-call component, I decide to wrap around Langchain's LLM models and Langchain's Chat models. And set the interface as follow: - Completion LLM component: ```python class CompletionLLM: def run_raw(self, text: str) -> LLMInterface: # Run text completion: str in -> LLMInterface out def run_batch_raw(self, text: list[str]) -> list[LLMInterface]: # Run text completion in batch: list[str] in -> list[LLMInterface] out # run_document and run_batch_document just reuse run_raw and run_batch_raw, due to unclear use case ``` - Chat LLM component: ```python class ChatLLM: def run_raw(self, text: str) -> LLMInterface: # Run chat completion (no chat history): str in -> LLMInterface out def run_batch_raw(self, text: list[str]) -> list[LLMInterface]: # Run chat completion in batch mode (no chat history): list[str] in -> list[LLMInterface] out def run_document(self, text: list[BaseMessage]) -> LLMInterface: # Run chat completion (with chat history): list[langchain's BaseMessage] in -> LLMInterface out def run_batch_document(self, text: list[list[BaseMessage]]) -> list[LLMInterface]: # Run chat completion in batch mode (with chat history): list[list[langchain's BaseMessage]] in -> list[LLMInterface] out ``` - The LLMInterface is as follow: ```python @dataclass class LLMInterface: text: list[str] completion_tokens: int = -1 total_tokens: int = -1 prompt_tokens: int = -1 logits: list[list[float]] = field(default_factory=list) ```	2023-08-29 15:47:12 +07:00
Nguyen Trung Duc (john)	e9d1d5c118	[AUR-401] Disable Haystack telemetry with monkey patching (#1 ) Sample Haystack log when running a pipeline. Note: the `pipeline.classname` can leak company information. ```json { "hardware.cpus": 16, "hardware.gpus": 0, "libraries.colab": false, "libraries.cuda": false, "libraries.haystack": "1.20.0rc0", "libraries.ipython": false, "libraries.pytest": false, "libraries.ray": false, "libraries.torch": false, "libraries.transformers": "4.31.0", "os.containerized": false, "os.family": "Linux", "os.machine": "x86_64", "os.version": "6.2.0-26-generic", "pipeline.classname": "TempPipeline", "pipeline.config_hash": "07a8eddd5a6e512c0d898c6d9f445ed9", "pipeline.nodes.PromptNode": 1, "pipeline.nodes.Shaper": 1, "pipeline.nodes.WebRetriever": 1, "pipeline.run_parameters.debug": false, "pipeline.run_parameters.documents": [ 0 ], "pipeline.run_parameters.file_paths": 0, "pipeline.run_parameters.labels": 0, "pipeline.run_parameters.meta": 1, "pipeline.run_parameters.params": false, "pipeline.run_parameters.queries": true, "pipeline.runs": 1, "pipeline.type": "Query", "python.version": "3.10.12" } ``` Solution: Haystack telemetry uses the `telemetry` variable, `posthog` library and `HAYSTACK_TELEMETRY_ENABLED` envar. We set the envar to False and make sure the relevant objects are disabled.	2023-08-22 10:02:46 +07:00
trducng	043209fda7	Initiate repository	2023-08-16 14:56:48 +07:00

40 Commits