|
Are there any LLM/NLP gurus in here? Asking before I make a big effort post.
|
# ¿ Apr 17, 2024 16:18 |
|
|
# ¿ May 17, 2024 02:37 |
|
I'll just go ahead and post anyway. This is a cross post from the Python thread:Cyril Sneer posted:Fun little learning project I want to do but need some direction. I want to extract the all the video transcripts from a particular youtube channel and make them both keyword and semantically searchable, returning the relevant video timestamps. To elaborate, here is a slightly longer example of how the transcript data is returned (this being 6 dictionary entries in the/a single transcript's list of dicts): code:
(1) https://mlops.community/how-to-build-your-first-semantic-search-system-my-step-by-step-guide-with-code/ Here, the case study is performing a semantic search on a corpus of paper abstracts. Each entry in the database is a (long) string of the abstract, along with metadata (author, title, etc). In my case, its easy enough to build a full transcript string from the sub-strings, but then I lose the timestamp info. I suppose if the search strategy can identify a location in the full string (rather than just identifying the transcript with the best match), I could a reverse look-up on those strings. (2) https://medium.com/ai-science/build-semantic-search-applications-using-open-source-vector-database-chromadb-a15e9e7f14ce In this case, the author shows how to add entries in the form of sentences (along with metadata), which is fairly close to what I have. From their example, I might do something like: code:
Next, if I'm understanding things correctly, neither of the above approaches are actually relying on LLMs but rather just similarity metrics within the embedding space? Hopefully I've explained this well enough...would love some direction on how to make this all work! I'm fine with packaged solutions so long as they're free/open source. Cyril Sneer fucked around with this message at 18:02 on Apr 17, 2024 |
# ¿ Apr 17, 2024 18:00 |
|
mightygerm posted:Yeah, you’re looking for a vector database. You can manually add the timestamp/video id as metadata to retrieve them alongside your sentence. What do you think of chromedb, used in the example I linked to? Also still unclear how these vector databases address this: Cyril Sneer posted:What's unclear to me here though is if I'll lose a lot of contextual information since the model/engine/tool doesn't 'see' the whole video transcript (or maybe it does via the IDs?). Or put another way, does breaking up the /a transcript in this way sacrifice contextual understanding? Finding a sentence is different than understanding a broader context within a paragraph, or the topical nature of a particular video at large.
|
# ¿ Apr 17, 2024 20:20 |
|
shrike82 posted:Look into embedding your logical units of text starting with small language models (look up sentence-transformers), saving those vectors as numpy arrays then doing a similarity search with your query I've implemented this now with chromadb, following that tutorial in link #2. It works, but its not quite giving me what I want. As I suspected, because my text samples are just these short word strings, it more or less just seems to be acting as a word search. I found a nice demo here, also using youtube transcripts. They address the above concern by chunking together 20 entries to make longer text samples, and also use a rolling window approach: https://github.com/lancedb/lancedb/blob/main/docs/src/notebooks/youtube_transcript_search.ipynb However I think they're still just relying on sentence-transformers and similarity searches. shrike82 posted:You can move onto Vector DBs and big-L LMs for embedding if you need more accuracy and are really dealing with big data. I'd like to start exploring this so if you've got some suggestions for next steps that'd be great
|
# ¿ Apr 19, 2024 05:54 |
|
shrike82 posted:it sounds like you're being limited by the way you're chunking up the audio transcripts, not necessarily embedding performance. Right, that was my hypothesis above. That the transcript 'units' are too small to be of much contextual value. For (2), luckily my target channel is just one person. I'm going to try the idea of chunking together 20 sentences with a rolling window of say 5, as was used done in that project I linked above. shrike82 posted:anyway, if you really want to try alternative embedders: That lancedb github link I posted actually does use an embedding from openAI, so I'll queue that up to try as well (they used it along with the chunking approach I described above).
|
# ¿ Apr 19, 2024 16:53 |