Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.
Are there any LLM/NLP gurus in here? Asking before I make a big effort post.

Adbot
ADBOT LOVES YOU

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.
I'll just go ahead and post anyway. This is a cross post from the Python thread:

Cyril Sneer posted:

Fun little learning project I want to do but need some direction. I want to extract the all the video transcripts from a particular youtube channel and make them both keyword and semantically searchable, returning the relevant video timestamps.

I've got the scraping/extraction part working. Each video transcript is returned as a list of dictionaries, where each dictionary contains the timestamp and a (roughly) sentence-worth of text:

code:
    {
    'text': 'replace the whole thing anyways right so',
     'start': 1331.08,
     'duration': 4.28
    }

I don't really know how YT breaks up the text, but I don't think it really matters. Anyway, I obviously don't want to re-extract the transcripts every time so I need to store everything in some kind of database -- and in manner amenable to reasonably speedy keyword searching. If we call this checkpoint 1, I don't have a good sense of what this solution would look like.

Next, I want to make the corpus of text (is that the right term?) semantically searchable. This part is even foggier. Do I train my own LLM from scratch? Do some kind of transfer learning thing (i.e., take existing model and provide my text as additional training data?) Can I just point chatGPT at it (lol)?

I want to eventually wrap it in a web UI, but I can handle that part. Thanks goons! This will be a neat project.

To elaborate, here is a slightly longer example of how the transcript data is returned (this being 6 dictionary entries in the/a single transcript's list of dicts):

code:
    [
    {'text': 'for all of you engineer types that are', 'start': 1503.84, 'duration': 2.4},
    {'text': 'watching here', 'start': 1505.36, 'duration': 2.799},
    {'text': "you know i'm going to tell you of course", 'start': 1506.24, 'duration': 4.319},
    {'text': 'the lead itself will act as an inductor', 'start': 1508.159, 'duration': 3.76},
    {'text': 'at a certain frequency', 'start': 1510.559, 'duration': 3.041},
    {'text': 'but for most of the frequencies that', 'start': 1511.919, 'duration': 4.0}
    ]
As you can see, YT breaks up the text every few words. I'm working with ~260 videos at about 40 mins in length each. Okay, with that in mind, I've been looking at two tutorials that set things up in different ways namely -


(1) https://mlops.community/how-to-build-your-first-semantic-search-system-my-step-by-step-guide-with-code/

Here, the case study is performing a semantic search on a corpus of paper abstracts. Each entry in the database is a (long) string of the abstract, along with metadata (author, title, etc). In my case, its easy enough to build a full transcript string from the sub-strings, but then I lose the timestamp info. I suppose if the search strategy can identify a location in the full string (rather than just identifying the transcript with the best match), I could a reverse look-up on those strings.



(2) https://medium.com/ai-science/build-semantic-search-applications-using-open-source-vector-database-chromadb-a15e9e7f14ce

In this case, the author shows how to add entries in the form of sentences (along with metadata), which is fairly close to what I have. From their example, I might do something like:

code:
    collection.add(
        documents=["for all of you engineer types that are", "watching here", "you know i'm going to tell you of course"],
        metadatas=[{"start": 1503.84}, {"start": 1506.24}, {"start": 1508.159}],
        ids=["aabbcc", "aabbcc","aabbcc"] #All from same video/transcript
    )  
What's unclear to me here though is if I'll lose a lot of contextual information since the model/engine/tool doesn't 'see' the whole video transcript (or maybe it does via the IDs?). Or put another way, does breaking up the /a transcript in this way sacrifice contextual understanding?


Next, if I'm understanding things correctly, neither of the above approaches are actually relying on LLMs but rather just similarity metrics within the embedding space?


Hopefully I've explained this well enough...would love some direction on how to make this all work! I'm fine with packaged solutions so long as they're free/open source.

Cyril Sneer fucked around with this message at 18:02 on Apr 17, 2024

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.

mightygerm posted:

Yeah, you’re looking for a vector database. You can manually add the timestamp/video id as metadata to retrieve them alongside your sentence.

There’s a couple good open source implementations like weaviate and lancedb.

What do you think of chromedb, used in the example I linked to?

Also still unclear how these vector databases address this:

Cyril Sneer posted:

What's unclear to me here though is if I'll lose a lot of contextual information since the model/engine/tool doesn't 'see' the whole video transcript (or maybe it does via the IDs?). Or put another way, does breaking up the /a transcript in this way sacrifice contextual understanding?

Finding a sentence is different than understanding a broader context within a paragraph, or the topical nature of a particular video at large.

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.

shrike82 posted:

Look into embedding your logical units of text starting with small language models (look up sentence-transformers), saving those vectors as numpy arrays then doing a similarity search with your query

I've implemented this now with chromadb, following that tutorial in link #2. It works, but its not quite giving me what I want. As I suspected, because my text samples are just these short word strings, it more or less just seems to be acting as a word search.

I found a nice demo here, also using youtube transcripts. They address the above concern by chunking together 20 entries to make longer text samples, and also use a rolling window approach:

https://github.com/lancedb/lancedb/blob/main/docs/src/notebooks/youtube_transcript_search.ipynb

However I think they're still just relying on sentence-transformers and similarity searches.


shrike82 posted:

You can move onto Vector DBs and big-L LMs for embedding if you need more accuracy and are really dealing with big data.

I'd like to start exploring this so if you've got some suggestions for next steps that'd be great :)

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.

shrike82 posted:

it sounds like you're being limited by the way you're chunking up the audio transcripts, not necessarily embedding performance.
think about how you could further preprocess the raw transcripts so that a chunk is meaty enough for search to work

1. youtube captioning, and video captioning more generally, stick together words so that they appear "in time" and neatly as one or two lines of text visually. you can apply some further heuristics around timestamps to gather captions together into larger logical units - e.g., logical chunks are separated by time gaps of > k seconds

2. do some speaker id (off-the-shelf models google-able) to separate text chunks by speakers


Right, that was my hypothesis above. That the transcript 'units' are too small to be of much contextual value. For (2), luckily my target channel is just one person.

I'm going to try the idea of chunking together 20 sentences with a rolling window of say 5, as was used done in that project I linked above.



shrike82 posted:

anyway, if you really want to try alternative embedders:

a. bge - considered SOTA open source/weights a couple months back: https://huggingface.co/BAAI/bge-large-en-v1.5
b. use one of the cloud embedding APIs (e.g., openai's)

That lancedb github link I posted actually does use an embedding from openAI, so I'll queue that up to try as well (they used it along with the chunking approach I described above).

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply