kotaemon

Author	SHA1	Message	Date
Ben Dykstra	f7b6f313b5	fix: update setup instructions (#144 ) #none * activate directory to gitignore * add my custom env to gitignore, will have to change that * add unstructured to kotaemon pyproject.toml * add .env to gitignore * remove .env from tracking * make changes to the run_macos script, update readme with more detailed instructions * remove my personal changes from gitignore * remove line from run_macos script * remove option for not installing miniconda for non technical users, mark docker dependency as optional * docs: update demo URL * gitignore changes * merge .env.example * revert changes to run_macos.sh * unstructured to advanced dependencies * add link to unstructured system dependencies * remove api key * fix: skip tests when unstructured pdf not installed * chore: loosen unstructured package version in pyproject.toml * chore: correct syntax --------- Co-authored-by: Tadashi <tadashi@cinnamon.is> Co-authored-by: cin-albert <albert@cinnamon.is>	2024-09-29 22:26:02 +07:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	2570e11501	feat: merge develop (#123 ) * Support hybrid vector retrieval * Enable figures and table reading in Azure DI * Retrieve with multi-modal * Fix mixing up table * Add txt loader * Add Anthropic Chat * Raising error when retrieving help file * Allow same filename for different people if private is True * Allow declaring extra LLM vendors * Show chunks on the File page * Allow elasticsearch to get more docs * Fix Cohere response (#86) * Fix Cohere response * Remove Adobe pdfservice from dependency kotaemon doesn't rely more pdfservice for its core functionality, and pdfservice uses very out-dated dependency that causes conflict. --------- Co-authored-by: trducng <trungduc1992@gmail.com> * Add confidence score (#87) * Save question answering data as a log file * Save the original information besides the rewritten info * Export Cohere relevance score as confidence score * Fix style check * Upgrade the confidence score appearance (#90) * Highlight the relevance score * Round relevance score. Get key from config instead of env * Cohere return all scores * Display relevance score for image * Remove columns and rows in Excel loader which contains all NaN (#91) * remove columns and rows which contains all NaN * back to multiple joiner options * Fix style --------- Co-authored-by: linhnguyen-cinnamon <cinmc0019@CINMC0019-LinhNguyen.local> Co-authored-by: trducng <trungduc1992@gmail.com> * Track retriever state * Bump llama-index version 0.10 * feat/save-azuredi-mhtml-to-markdown (#93) * feat/save-azuredi-mhtml-to-markdown * fix: replace os.path to pathlib change theflow.settings * refactor: base on pre-commit * chore: move the func of saving content markdown above removed_spans --------- Co-authored-by: jacky0218 <jacky0218@github.com> * fix: losing first chunk (#94) * fix: losing first chunk. * fix: update the method of preventing losing chunks --------- Co-authored-by: jacky0218 <jacky0218@github.com> * fix: adding the base64 image in markdown (#95) * feat: more chunk info on UI * fix: error when reindexing files * refactor: allow more information exception trace when using gpt4v * feat: add excel reader that treats each worksheet as a document * Persist loader information when indexing file * feat: allow hiding unneeded setting panels * feat: allow specific timezone when creating conversation * feat: add more confidence score (#96) * Allow a list of rerankers * Export llm reranking score instead of filter with boolean * Get logprobs from LLMs * Rename cohere reranking score * Call 2 rerankers at once * Run QA pipeline for each chunk to get qa_score * Display more relevance scores * Define another LLMScoring instead of editing the original one * Export logprobs instead of probs * Call LLMScoring * Get qa_score only in the final answer * feat: replace text length with token in file list * ui: show index name instead of id in the settings * feat(ai): restrict the vision temperature * fix(ui): remove the misleading message about non-retrieved evidences * feat(ui): show the reasoning name and description in the reasoning setting page * feat(ui): show version on the main windows * feat(ui): show default llm name in the setting page * fix(conf): append the result of doc in llm_scoring (#97) * fix: constraint maximum number of images * feat(ui): allow filter file by name in file list page * Fix exceeding token length error for OpenAI embeddings by chunking then averaging (#99) * Average embeddings in case the text exceeds max size * Add docstring * fix: Allow empty string when calling embedding * fix: update trulens LLM ranking score for retrieval confidence, improve citation (#98) * Round when displaying not by default * Add LLMTrulens reranking model * Use llmtrulensscoring in pipeline * fix: update UI display for trulen score --------- Co-authored-by: taprosoft <tadashi@cinnamon.is> * feat: add question decomposition & few-shot rewrite pipeline (#89) * Create few-shot query-rewriting. Run and display the result in info_panel * Fix style check * Put the functions to separate modules * Add zero-shot question decomposition * Fix fewshot rewriting * Add default few-shot examples * Fix decompose question * Fix importing rewriting pipelines * fix: update decompose logic in fullQA pipeline --------- Co-authored-by: taprosoft <tadashi@cinnamon.is> * fix: add encoding utf-8 when save temporal markdown in vectorIndex (#101) * fix: improve retrieval pipeline and relevant score display (#102) * fix: improve retrieval pipeline by extending first round top_k with multiplier * fix: minor fix * feat: improve UI default settings and add quick switch option for pipeline * fix: improve agent logics (#103) * fix: improve agent progres display * fix: update retrieval logic * fix: UI display * fix: less verbose debug log * feat: add warning message for low confidence * fix: LLM scoring enabled by default * fix: minor update logics * fix: hotfix image citation * feat: update docx loader for handle merged table cells + handle zip file upload (#104) * feat: update docx loader for handle merged table cells * feat: handle zip file * refactor: pre-commit * fix: escape text in download UI * feat: optimize vector store query db (#105) * feat: optimize vector store query db * feat: add file_id to chroma metadatas * feat: remove unnecessary logs and update migrate script * feat: iterate through file index * fix: remove unused code --------- Co-authored-by: taprosoft <tadashi@cinnamon.is> * fix: add openai embedidng exponential back-off * fix: update import download_loader * refactor: codespell * fix: update some default settings * fix: update installation instruction * fix: default chunk length in simple QA * feat: add share converstation feature and enable retrieval history (#108) * feat: add share converstation feature and enable retrieval history * fix: update share conversation UI --------- Co-authored-by: taprosoft <tadashi@cinnamon.is> * fix: allow exponential backoff for failed OCR call (#109) * fix: update default prompt when no retrieval is used * fix: create embedding for long image chunks * fix: add exception handling for additional table retriever * fix: clean conversation & file selection UI * fix: elastic search with empty doc_ids * feat: add thumbnail PDF reader for quick multimodal QA * feat: add thumbnail handling logic in indexing * fix: UI text update * fix: PDF thumb loader page number logic * feat: add quick indexing pipeline and update UI * feat: add conv name suggestion * fix: minor UI change * feat: citation in thread * fix: add conv name suggestion in regen * chore: add assets for usage doc * chore: update usage doc * feat: pdf viewer (#110) * feat: update pdfviewer * feat: update missing files * fix: update rendering logic of infor panel * fix: improve thumbnail retrieval logic * fix: update PDF evidence rendering logic * fix: remove pdfjs built dist * fix: reduce thumbnail evidence count * chore: update gitignore * fix: add js event on chat msg select * fix: update css for viewer * fix: add env var for PDFJS prebuilt * fix: move language setting to reasoning utils --------- Co-authored-by: phv2312 <kat87yb@gmail.com> Co-authored-by: trducng <trungduc1992@gmail.com> * feat: graph rag (#116) * fix: reload server when add/delete index * fix: rework indexing pipeline to be able to disable vectorstore and splitter if needed * feat: add graphRAG index with plot view * fix: update requirement for graphRAG and lighten unnecessary packages * feat: add knowledge network index (#118) * feat: add Knowledge Network index * fix: update reader mode setting for knet * fix: update init knet * fix: update collection name to index pipeline * fix: missing req --------- Co-authored-by: jeff52415 <jeff.yang@cinnamon.is> * fix: update info panel return for graphrag * fix: retriever setting graphrag * feat: local llm settings (#122) * feat: expose context length as reasoning setting to better fit local models * fix: update context length setting for agents * fix: rework threadpool llm call * fix: fix improve indexing logic * fix: fix improve UI * feat: add lancedb * fix: improve lancedb logic * feat: add lancedb vectorstore * fix: lighten requirement * fix: improve lanceDB vs * fix: improve UI * fix: openai retry * fix: update reqs * fix: update launch command * feat: update Dockerfile * feat: add plot history * fix: update default config * fix: remove verbose print * fix: update default setting * fix: update gradio plot return * fix: default gradio tmp * fix: improve lancedb docstore * fix: fix question decompose pipeline * feat: add multimodal reader in UI * fix: udpate docs * fix: update default settings & docker build * fix: update app startup * chore: update documentation * chore: update README * chore: update README --------- Co-authored-by: trducng <trungduc1992@gmail.com> * chore: update README * chore: update README --------- Co-authored-by: trducng <trungduc1992@gmail.com> Co-authored-by: cin-ace <ace@cinnamon.is> Co-authored-by: Linh Nguyen <70562198+linhnguyen-cinnamon@users.noreply.github.com> Co-authored-by: linhnguyen-cinnamon <cinmc0019@CINMC0019-LinhNguyen.local> Co-authored-by: cin-jacky <101088014+jacky0218@users.noreply.github.com> Co-authored-by: jacky0218 <jacky0218@github.com> Co-authored-by: kan_cin <kan@cinnamon.is> Co-authored-by: phv2312 <kat87yb@gmail.com> Co-authored-by: jeff52415 <jeff.yang@cinnamon.is>	2024-08-26 08:50:37 +07:00
ian_Cin	8985963e1e	Setup app data dir (#32 ) * setup local data dir * update readme * update chat panel * update help page	2024-04-13 23:26:06 +07:00
Duc Nguyen (john)	a203fc0f7c	Allow users to add LLM within the UI (#6 ) * Rename AzureChatOpenAI to LCAzureChatOpenAI * Provide vanilla ChatOpenAI and AzureChatOpenAI * Remove the highest accuracy, lowest cost criteria These criteria are unnecessary. The users, not pipeline creators, should choose which LLM to use. Furthermore, it's cumbersome to input this information, really degrades user experience. * Remove the LLM selection in simple reasoning pipeline * Provide a dedicated stream method to generate the output * Return placeholder message to chat if the text is empty	2024-04-06 11:53:17 +07:00
ian_Cin	ecf09b275f	Fix UI bugs (#8 ) * Auto create conversation when the user starts * Add conversation rename rule check * Fix empty name during save * Confirm deleting conversation * Show warning if users don't select file when upload files in the File Index * Feedback when user uploads duplicated file * Limit the file types * Fix valid username * Allow login when username with leading and trailing whitespaces * Improve the user * Disable admin panel for non-admnin user * Refresh user lists after creating/deleting users * Auto logging in * Clear admin information upon signing out * Fix unable to receive uploaded filename that include special characters, like !@#$%^&().pdf Set upload validation for FileIndex * Improve user management UI/UIX * Show extraction error when indexing file * Return selected user -1 when signing out * Fix default supported file types in file index * Validate changing password * Allow the selector to contain mulitple gradio components * A more tolerable placeholder screen * Allow chat suggestion box * Increase concurrency limit * Make adobe loader optional * Use BaseReasoning --------- Co-authored-by: trducng <trungduc1992@gmail.com>	2024-04-03 16:33:54 +07:00
ian	b6ac35029f	remove git secret	2024-03-28 16:04:12 +07:00
ian_Cin	df12dec732	Feat/local endpoint llm (#148 ) * serve local model in a different process from the app --------- Co-authored-by: albert <albert@cinnamon.is> Co-authored-by: trducng <trungduc1992@gmail.com>	2024-03-15 16:17:33 +07:00
Albert (Quang)	cc87aaa783	Add one-click installers for Linux, Windows, and MacOS (#146 ) * feat: Add installers for linux, windows, and macos * docs: Update README * pre-commit fix styles * Update installers and README * Remove env vars check and fix paths * Update installers: * Remove start.py and move install and launch part back to .sh/.bat * Add conda deactivate * Make messages more informative * Improve kotaemon based on insights from projects (#147) - Include static files in the package. - More reliable information panel. Faster & not breaking randomly. - Add directory upload. - Enable zip file to upload. - Allow setting endpoint for the OCR reader using environment variable. * feat: Add installers for linux, windows, and macos * docs: Update README * pre-commit fix styles * Update installers and README * Remove env vars check and fix paths * Update installers: * Remove start.py and move install and launch part back to .sh/.bat * Add conda deactivate * Make messages more informative * Make macOS installer runable and improve Windows, Linux installers * Minor fix macos commands * installation should pause before exit * Update Windows installer: add a new label to exit function with error * put install_dir to .gitignore * chore: Add comments to clarify the 'end' labels --------- Co-authored-by: Duc Nguyen (john) <trungduc1992@gmail.com> Co-authored-by: ian <ian@cinnamon.is>	2024-03-06 10:59:30 +07:00
Duc Nguyen (john)	2dd531114f	Make ktem official (#134 ) * Move kotaemon and ktem into same folder * Update docs * Update CI * Resolve mypy, isorts * Re-allow test pdf files	2024-01-23 10:54:18 +07:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	9945afdf6f	Add Reranker implementation and integration in Retrieving pipeline (#77 ) * Add base Reranker * Add LLM Reranker * Add Cohere Reranker * Add integration of Rerankers in Retrieving pipeline	2023-11-15 16:03:51 +07:00
ian_Cin	84f1fa8cbd	[AUR-395] Adopt Example1 disclaimer pipeline (#42 ) * Adopt Example1 disclaimer pipeline * Update Document class * Add composite components * Modify Extractor behaviours	2023-10-10 15:42:48 +07:00
Nguyen Trung Duc (john)	6ab1854532	feat: Add chain-of-thought (#37 ) * Add chain-of-thought * Use BasePromptComponent * Add terminate callback for the chain-of-thought	2023-10-04 02:16:33 +07:00
ian_Cin	d83c22aa4e	[AUR-395, AUR-415] Adopt Example1 Injury pipeline; add .flow() for enabling bottom-up pipeline execution (#32 ) * add example1/injury pipeline example * add dotenv * update various api	2023-10-02 16:24:56 +07:00
Tuan Anh Nguyen Dang (Tadashi_Cin)	6207f4332a	[AUR-430] Add test case for Chroma VectoStore save/load (#26 ) * add test case for Chroma save/load * minor name change * add delete_collection support for chroma * move save load to chroma --------- Co-authored-by: Nguyen Trung Duc (john) <john@cinnamon.is>	2023-09-26 10:58:41 +07:00
ian_Cin	5241edbc46	[AUR-361] Setup pre-commit, pytest, GitHub actions, ssh-secret (#3 ) Co-authored-by: trducng <trungduc1992@gmail.com>	2023-08-30 07:22:01 +07:00
Nguyen Trung Duc (john)	c3c25db48c	[AUR-385, AUR-388] Declare BaseComponent and decide LLM call interface (#2 ) - Use cases related to LLM call: https://cinnamon-ai.atlassian.net/browse/AUR-388?focusedCommentId=34873 - Sample usages: `test_llms_chat_models.py` and `test_llms_completion_models.py`: ```python from kotaemon.llms.chats.openai import AzureChatOpenAI model = AzureChatOpenAI( openai_api_base="https://test.openai.azure.com/", openai_api_key="some-key", openai_api_version="2023-03-15-preview", deployment_name="gpt35turbo", temperature=0, request_timeout=60, ) output = model("hello world") ``` For the LLM-call component, I decide to wrap around Langchain's LLM models and Langchain's Chat models. And set the interface as follow: - Completion LLM component: ```python class CompletionLLM: def run_raw(self, text: str) -> LLMInterface: # Run text completion: str in -> LLMInterface out def run_batch_raw(self, text: list[str]) -> list[LLMInterface]: # Run text completion in batch: list[str] in -> list[LLMInterface] out # run_document and run_batch_document just reuse run_raw and run_batch_raw, due to unclear use case ``` - Chat LLM component: ```python class ChatLLM: def run_raw(self, text: str) -> LLMInterface: # Run chat completion (no chat history): str in -> LLMInterface out def run_batch_raw(self, text: list[str]) -> list[LLMInterface]: # Run chat completion in batch mode (no chat history): list[str] in -> list[LLMInterface] out def run_document(self, text: list[BaseMessage]) -> LLMInterface: # Run chat completion (with chat history): list[langchain's BaseMessage] in -> LLMInterface out def run_batch_document(self, text: list[list[BaseMessage]]) -> list[LLMInterface]: # Run chat completion in batch mode (with chat history): list[list[langchain's BaseMessage]] in -> list[LLMInterface] out ``` - The LLMInterface is as follow: ```python @dataclass class LLMInterface: text: list[str] completion_tokens: int = -1 total_tokens: int = -1 prompt_tokens: int = -1 logits: list[list[float]] = field(default_factory=list) ```	2023-08-29 15:47:12 +07:00
Nguyen Trung Duc (john)	e9d1d5c118	[AUR-401] Disable Haystack telemetry with monkey patching (#1 ) Sample Haystack log when running a pipeline. Note: the `pipeline.classname` can leak company information. ```json { "hardware.cpus": 16, "hardware.gpus": 0, "libraries.colab": false, "libraries.cuda": false, "libraries.haystack": "1.20.0rc0", "libraries.ipython": false, "libraries.pytest": false, "libraries.ray": false, "libraries.torch": false, "libraries.transformers": "4.31.0", "os.containerized": false, "os.family": "Linux", "os.machine": "x86_64", "os.version": "6.2.0-26-generic", "pipeline.classname": "TempPipeline", "pipeline.config_hash": "07a8eddd5a6e512c0d898c6d9f445ed9", "pipeline.nodes.PromptNode": 1, "pipeline.nodes.Shaper": 1, "pipeline.nodes.WebRetriever": 1, "pipeline.run_parameters.debug": false, "pipeline.run_parameters.documents": [ 0 ], "pipeline.run_parameters.file_paths": 0, "pipeline.run_parameters.labels": 0, "pipeline.run_parameters.meta": 1, "pipeline.run_parameters.params": false, "pipeline.run_parameters.queries": true, "pipeline.runs": 1, "pipeline.type": "Query", "python.version": "3.10.12" } ``` Solution: Haystack telemetry uses the `telemetry` variable, `posthog` library and `HAYSTACK_TELEMETRY_ENABLED` envar. We set the envar to False and make sure the relevant objects are disabled.	2023-08-22 10:02:46 +07:00

17 Commits