|
yeah mixtral excels in a lot of traditional NLP tasks like NER, sentiment analysis etc. on unstructured data. in terms of open source models, llama-70b wasn't good enough but mixtral is the first LLM that might be "good enough" for a lot of zero shot/few shot tasks where you previously had to finetune a BERT-type transformer model. the big challenge right now for businesses trying to leverage LLMs is building a pipeline for their data (.ppt, .xlsx, .doc, .pdf, sharepoint connectors etc.) into RAG/LLM solutions as well as out of the model(s)
|
# ¿ Mar 22, 2024 00:38 |
|
|
# ¿ May 21, 2024 04:47 |
|
Vector DBs (and probably LLMs too) are overkill for the first step of a hobbyist semantic search project. Look into embedding your logical units of text starting with small language models (look up sentence-transformers), saving those vectors as numpy arrays then doing a similarity search with your query You can move onto Vector DBs and big-L LMs for embedding if you need more accuracy and are really dealing with big data.
|
# ¿ Apr 18, 2024 07:51 |
|
it sounds like you're being limited by the way you're chunking up the audio transcripts, not necessarily embedding performance. think about how you could further preprocess the raw transcripts so that a chunk is meaty enough for search to work 1. youtube captioning, and video captioning more generally, stick together words so that they appear "in time" and neatly as one or two lines of text visually. you can apply some further heuristics around timestamps to gather captions together into larger logical units - e.g., logical chunks are separated by time gaps of > k seconds 2. do some speaker id (off-the-shelf models google-able) to separate text chunks by speakers anyway, if you really want to try alternative embedders: a. bge - considered SOTA open source/weights a couple months back: https://huggingface.co/BAAI/bge-large-en-v1.5 b. use one of the cloud embedding APIs (e.g., openai's)
|
# ¿ Apr 19, 2024 06:24 |