Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
shrike82
Jun 11, 2005

yeah mixtral excels in a lot of traditional NLP tasks like NER, sentiment analysis etc. on unstructured data.
in terms of open source models, llama-70b wasn't good enough but mixtral is the first LLM that might be "good enough" for a lot of zero shot/few shot tasks where you previously had to finetune a BERT-type transformer model.

the big challenge right now for businesses trying to leverage LLMs is building a pipeline for their data (.ppt, .xlsx, .doc, .pdf, sharepoint connectors etc.) into RAG/LLM solutions as well as out of the model(s)

Adbot
ADBOT LOVES YOU

shrike82
Jun 11, 2005

Vector DBs (and probably LLMs too) are overkill for the first step of a hobbyist semantic search project.

Look into embedding your logical units of text starting with small language models (look up sentence-transformers), saving those vectors as numpy arrays then doing a similarity search with your query

You can move onto Vector DBs and big-L LMs for embedding if you need more accuracy and are really dealing with big data.

shrike82
Jun 11, 2005

it sounds like you're being limited by the way you're chunking up the audio transcripts, not necessarily embedding performance.
think about how you could further preprocess the raw transcripts so that a chunk is meaty enough for search to work

1. youtube captioning, and video captioning more generally, stick together words so that they appear "in time" and neatly as one or two lines of text visually. you can apply some further heuristics around timestamps to gather captions together into larger logical units - e.g., logical chunks are separated by time gaps of > k seconds

2. do some speaker id (off-the-shelf models google-able) to separate text chunks by speakers

anyway, if you really want to try alternative embedders:

a. bge - considered SOTA open source/weights a couple months back: https://huggingface.co/BAAI/bge-large-en-v1.5
b. use one of the cloud embedding APIs (e.g., openai's)

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply