Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
ProvostZak
Dec 31, 2000

it's a huge step
Not entirely sure if this is okay to post here, since it's a little long, but it seems to fit.

I posted a thread on this topic a while back (http://forums.somethingawful.com/showthread.php?threadid=2699687) and ya'll were very helpful, though my extremely poor perl skills meant I couldn't make much use of all the help given. But I have gotten a little bit better now...

In a nutshell, I wanted to break a csv file filled with long paragraphs of text in rows into groups of single and multiple words, and then count the iterations of each word set. So for example, the previous two paragraphs would break down as follows:

    I
    posted
    a
    thread

    I posted
    posted a
    a thread

    I posted a
    posted a thread
    a thread on


And I initially planned for the script to count the number of times words or phrases appeared. I have since realised that I can use Access to cover the relatively hard part of the script (the counting of entries), which leaves me with the more straightforward problem of getting the damnable sentences to split at the correct point. Here's the (no doubt quite ugly) script I was able to write:

code:
$filename = "samplefile.txt";
open ( FILE, $filename) or die "Cannot open file: $!";

 while ( $data = <FILE> ) {

 $data =~ tr/A-Z/a-z/;

 @values = split(/\s/, $data);

 foreach $val (@values)

 {
       open (OUTFILE, '>>testfile.txt');
       print OUTFILE "$val\n";
       close (OUTFILE);
 }
}
exit 0;
This code does the simplest bit of my request: it breaks on whitespace, thus giving me the first set of the list I need. The challenge though, which I have yet to crack, is getting the damnable thing to break on word pairs, triples etc. I have tried a lot of different things that seems like they might work, such as /w+\s\w+/ but I get weird results – either it just breaks on the end of each row (basically adding an empty row between every full paragraph entry) or it replaces all words that aren't directly adjacent to a character like "-" or "." with a blank paragraph break.

I'm also not sure how to get it to 'fall back' after its done a pair. So for example, if it gets "I posted" how am I going to get it to give me "posted a" instead of moving onto the 'next' pair, "a thread"...

Any perl expert wisdom?

Adbot
ADBOT LOVES YOU

ProvostZak
Dec 31, 2000

it's a huge step
Okay, I've been trying to get the code Triple Tech and Speed Frog suggested to work but I just cannot get my head around it. I guess the main problem is I'm not entirely sure what I'm doing with the code that works. This is what I understand of the code I have:

code:
$filename = "samplefile.txt";
open ( FILE, $filename) or die "Cannot open file: $!";

 while ( $data = <FILE> ) {
This opens the file and creates a loop in which each line of the file is fed into $data

code:
$data =~ tr/A-Z/a-z/;
This puts everything into lowercase, but can be replaced with $data = lc($data); for the same effect.

code:
@values = split(/\s/, $data);
This takes the contents of $data and breaks it on each space, and then feeds these pieces into an array.

code:
foreach $val (@values)
 {
       open (OUTFILE, '>>testfile.txt');
       print OUTFILE "$val\n";
       close (OUTFILE);
 }
This line takes each element in the array @values and tags it as $val, before feeding it into the new testfile. It pastes in each element followed by a break.

Then the code loops again back to the start of the 'while' function, starting on the next row in the samplefile.

Okay, so the code you guys suggested is kind of swamping me. This is me trying to understand what is going on.

code:
my $i = 0;
while (1) {
This creates a loop and $i, setting it to zero.

code:
my ($double, $triple) = ($i + 1, $i + 2);
This creates two variables: $double and $triple and sets them to 1 and 2 to start with.

code:
$double < @words and push @doubles, [ @words[$i .. $double] ];
$triple < @words and push @triples, [ @words[$i .. $triple] ];
This does… the important bit. I think it checks if $double is less than the number assigned to the last element in the array @words and if it is then it adds the contents of the square brackets (which are the values in the array that have been assigned those numbers) onto the end of @doubles. In the case of $triple it takes the total of three values between $i and $triple.

code:
last if ++$i > @words;
}
This kicks us out of the while loop if incrementing $i would mean we had run out of elements in our array @words. If not it just increments $i (I think…)

So I tried the following code:

code:
$filename = "sample of no carriage returns and no blank entries.txt";
open (FILE, $filename) or die "Cannot open file: $!";

 while ($data = <FILE>)

{

 @words = split(/\s/, $data);

 my $i = 0;
 while (1)
 {
 my $double = $i + 1;

 $double < @words and push @doubles, [ @words[$i .. $double] ];

 last if ++$i > @words;
 }

 foreach $val (@doubles)

  {
       open (OUTFILE, '>>test6splitfile.txt');
       print OUTFILE "$val\n";
       close (OUTFILE);
  }

}

 exit 0;
This gives me output like this:

ARRAY(0x8f602c)
ARRAY(0x8f6050)
ARRAY(0x8f6080)
ARRAY(0x8f60b0)
ARRAY(0x8f60e0)
ARRAY(0x8f6110)
ARRAY(0x8f6140)
Etc. etc. etc.

I'm at a loss :(

  • Locked thread