Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
Mithaldu
Sep 25, 2007

Let's cuddle. :3:


FreshFeesh posted:

I'm trying to iterate through an alphabetized list and edit/remove entries based on particular criteria, but as I scale up the code I want to make sure I'm being as clean as possible, since I haven't touched this stuff in a while.

code:
@array = ('cat', 'dog 1', 'dog 2', 'dog 3', 'duck');

Desired output:    cat | dog 1, 2, 3 | duck
Here's my code which does the job, but looks fairly inefficient to me:
code:
for my $i (0 .. $#array) {
	$array[$i-1] =~ m/(\w+)\b/;
	if ($array[$i] =~ m/^$1\b/) {
		$array[$i] =~ m/(\d+)$/;
		$output[-1] .= ", $1";
	}
	else { push(@output, $array[$i]); }
}
It outputs correctly, but I'd be grateful for any suggestions you guys have for ways I can clean it up, so when I scale it out to a more complicated list it's easier to manage.

Do you actually care about whether these things are in sequence? If not, you can just split the string on a space, use the first value as a key, and dump things into a hash of arrayrefs.

Adbot
ADBOT LOVES YOU

FreshFeesh
Jun 3, 2007

Drum Solo
I considered that, but unfortunately my data is more akin to
code:
Favorite Food: Spaghetti
Favorite Food: Tofu Lasagna
Green Beans 1
Green Beans 2
I like monkeys
Favorite Drink: Coors
Which makes splitting on whitespace problematic. I ended up with two different tests, one for lines ending on a number and other specific entries which would contain a colon as a delimiter:
code:
for my $i (0 .. $#newout) {
	if ($newout[$i] =~ m/\d+$/) {
		$newout[$i-1] =~ m/^([a-zA-Z\s]+)\d*/;
		if ($newout[$i] =~ m/^$1/) {
			$newout[$i] =~ m/(\d+)$/;
			$output[-1] .= ", $1";
		} else { 
			push(@output, $newout[$i]); 
		}
	}
	elsif ($newout[$i] =~ m/^Favorite Drink:/ or $newout[$i] =~ m/^Favorite Food:/ or $newout[$i] =~ m/^Something Else:/ or $newout[$i] =~ m/^Dance Party:/) {
		$newout[$i-1] =~ m/^([\w\s]+)/;
		if ($newout[$i] =~ m/^$1/) {
			$newout[$i] =~ m/: ([\w\s]+)$/;
			$output[-1] .= ", $1";
		} else { 
			push(@output, $newout[$i]); 
		}
	}
	else { 
		push(@output, $newout[$i]); 
	}
}
(I know my variable names are crap)

FreshFeesh fucked around with this message at 13:09 on Jul 24, 2016

Mithaldu
Sep 25, 2007

Let's cuddle. :3:
Seriously, do fix those var names, they're the main thing making this hard to read. Also, you can condense this:

elsif ($newout[$i] =~ m/^Favorite Drink:/ or $newout[$i] =~ m/^Favorite Food:/ or $newout[$i] =~ m/^Something Else:/ or $newout[$i] =~ m/^Dance Party:/) {

to

elsif ($newout[$i] =~ m/^(Favorite Drink|Favorite Food|Something Else|Dance Party):/) {

Also add a next; after each operation where you consider the line concluded, and you can get rid of a lot of the control structure.


code:
my @summary, @lines;
for my $i ( 0 .. $#lines ) {
    my $line      = $lines[$i];
    my $prev_line = $lines[$prev];
    if ( $line =~ m/\d+$/ ) {
        $prev_line =~ m/^([a-zA-Z\s]+)\d*/;
        if ( $line =~ m/^$1/ ) {
            $line =~ m/(\d+)$/;
            $summary[-1] .= ", $1";
            next;
        }
        push @summary, $line;
        next;
    }

    if ( $line =~ m/^(Favorite Drink|Favorite Food|Something Else|Dance Party):/ ) {
        $prev_line =~ m/^([\w\s]+)/;
        if ( $line =~ m/^$1/ ) {
            $line =~ m/: ([\w\s]+)$/;
            $summary[-1] .= ", $1";
            next;
        }
        push @summary, $line;
        next;
    }

    push @summary, $line;
}

FreshFeesh
Jun 3, 2007

Drum Solo
Thank you for the suggestions, they're certainly helping and things look a lot better.

octobernight
Nov 25, 2004
High Priest of the Christian-Atheist Church
I'm working with large 200-500 Gb text files, and I want to find lines that match a pre-defined list line ids and return the next line following the match. Here's my starting code:

code:
my %ids = ("foo",undef, "bar",undef, "baz",undef );
open(INPUT, "big.txt");
while (my $line = <INPUT>) {
	my $next = <INPUT>;
	if (defined $ids{$line}) {
		$ids{$line} = $next;
	}
}
When I run this code, according to iostat, my bottleneck is CPU (roughly 80-90% cpu is user, only 0.10% is iowait). This makes sense because the system's bandwidth is 200-300 Gb/s so I/O shouldn't be an issue. My machine has 16 cpus, so I suspect that if I run 16 threads, each reading at a different chunk of data, it would greatly speed up this code. My idea would be to have each thread seek into 16 positions into the file and start processing from there. I haven't worked out the details (need to figure out what happens when thread doesn't start in the right location as the positions are just estimates), but I wanted to know what other people thought, and whether they had any alternative suggestions?

Ellie Crabcakes
Feb 1, 2008

Stop emailing my boyfriend Gay Crungus

octobernight posted:

I'm working with large 200-500 Gb text files, and I want to find lines that match a pre-defined list line ids and return the next line following the match. Here's my starting code:

code:
my %ids = ("foo",undef, "bar",undef, "baz",undef );
open(INPUT, "big.txt");
while (my $line = <INPUT>) {
	my $next = <INPUT>;
	if (defined $ids{$line}) {
		$ids{$line} = $next;
	}
}
When I run this code, according to iostat, my bottleneck is CPU (roughly 80-90% cpu is user, only 0.10% is iowait). This makes sense because the system's bandwidth is 200-300 Gb/s so I/O shouldn't be an issue. My machine has 16 cpus, so I suspect that if I run 16 threads, each reading at a different chunk of data, it would greatly speed up this code. My idea would be to have each thread seek into 16 positions into the file and start processing from there. I haven't worked out the details (need to figure out what happens when thread doesn't start in the right location as the positions are just estimates), but I wanted to know what other people thought, and whether they had any alternative suggestions?
This code won''t work.

1) It won't match the ids unless you chomp the line first.
2)defined $ids{foo} is always going to return false because the value of $id{foo} is set to undef. What you want is exists $id{foo}

Unless the id line and the content line have fixed widths--and I imagine they do not--then threads aren't going to help. First off, you won't know how many threads to spawn because you have to read through the entire file to get a line count. Then, each thread is going to have to read lines and throw them away until they get to their starting point.

homercles
Feb 14, 2010

You can make threads work, certainly you can. (well I'd use Parallel::ForkManager mostly because true threads aren't needed)

Is there a way to differentiate the key lines from the value lines? Do keys match a certain pattern? If you seek to a part of the file, assume you're mid-line and read in the current line assuming it's junk, then scan in the next line, can you determine via a regex if that line is a key, or if it's a value? If so then this is the easiest case, each child can run independent of the parent.

Otherwise, just have each thread assume that either one of the following are true, the first line in its slice is a key and subsequent is a value, OR that the first line is a value and subsequent lines form key/value pairs. When done, send both results to the parent, and the number of lines processed. Once the parent has all processed results from its children, it can reconstruct whether the first line each slice processed was a key or a value (based on the number of scanned in lines each child sends to the parent) and choose the processed results accordingly. It might be space prohibitive and it's going to be a bit pesky to write, but it's conceptually simple.

Alternately, maybe you can use gnu parallel for this.

A test program: perl -e 'BEGIN { $x = "x"x60 } print "key $_ $x\nval $_ $x\n" for 1 .. 214' | parallel --no-notice --block 500 --pipe -L 100 egrep --line-buffered -A1 -n "'^key .*7 '"

How you'd use it: cat BIGFILE.txt | parallel --no-notice --pipe -L 50000000 fgrep --line-buffered -A1 -x -f PATTERNSFILE.txt

You would have to parse the output to make sure matched keys are on 'odd' lines, GNU Parallel will limit the maximum number of line sent to fgrep forcing each new parallel block to be on a clean guaranteed key line. GNU Parallel is also written in Perl so it might not be much of a performance improvement though.

Mithaldu
Sep 25, 2007

Let's cuddle. :3:
There's one thing you need to be aware of:

Threads in Perl were created by Activestate on a contract from Microsoft to create fork emulation in Perl on Windows.

This means whenever you start a thread in Perl you actually start a hacky fork emulator, with the consequences being:

- thread startup is super expensive, start as few as you can and reuse
- thread startup duplicates the memory use of the mother thread; so start threads as early as possible, with as few things loaded as possible
- thread data is not shared by default, you need to mark every variable you wish to share
- threads are unreliable and break easily, try to use multiple processes and/or native fork first

octobernight
Nov 25, 2004
High Priest of the Christian-Atheist Church
Thanks for the replies all.

John Big Booty posted:

This code won''t work.

1) It won't match the ids unless you chomp the line first.
2)defined $ids{foo} is always going to return false because the value of $id{foo} is set to undef. What you want is exists $id{foo}

Unless the id line and the content line have fixed widths--and I imagine they do not--then threads aren't going to help. First off, you won't know how many threads to spawn because you have to read through the entire file to get a line count. Then, each thread is going to have to read lines and throw them away until they get to their starting point.

Thanks! I meant to use exists, not defined, and I left out the code that trims the lines so that all whitespace is removed from the ends. I wasn't planning on start exactly at a particular line, rather I was going to jump to XXXX byte location for each thread. I would then throw away whatever line it read it and find the first complete line that has the line start identifier (in my file, the id format is $id). It was something off the top of my head, though.

homercles posted:

You can make threads work, certainly you can. (well I'd use Parallel::ForkManager mostly because true threads aren't needed)

Is there a way to differentiate the key lines from the value lines? Do keys match a certain pattern? If you seek to a part of the file, assume you're mid-line and read in the current line assuming it's junk, then scan in the next line, can you determine via a regex if that line is a key, or if it's a value? If so then this is the easiest case, each child can run independent of the parent.

Yes, there is. Id lines start with "$" so it makes it easy to figure out if it's an id line or value line. I'll probably try my seek strategy then.

quote:

Alternately, maybe you can use gnu parallel for this.

A test program: perl -e 'BEGIN { $x = "x"x60 } print "key $_ $x\nval $_ $x\n" for 1 .. 214' | parallel --no-notice --block 500 --pipe -L 100 egrep --line-buffered -A1 -n "'^key .*7 '"

How you'd use it: cat BIGFILE.txt | parallel --no-notice --pipe -L 50000000 fgrep --line-buffered -A1 -x -f PATTERNSFILE.txt

You would have to parse the output to make sure matched keys are on 'odd' lines, GNU Parallel will limit the maximum number of line sent to fgrep forcing each new parallel block to be on a clean guaranteed key line. GNU Parallel is also written in Perl so it might not be much of a performance improvement though.

Thanks for this, too. I've never used the parallel in GNU before, but I'm reading over it now.


Mithaldu posted:

There's one thing you need to be aware of:

Threads in Perl were created by Activestate on a contract from Microsoft to create fork emulation in Perl on Windows.

This means whenever you start a thread in Perl you actually start a hacky fork emulator, with the consequences being:

- thread startup is super expensive, start as few as you can and reuse
- thread startup duplicates the memory use of the mother thread; so start threads as early as possible, with as few things loaded as possible
- thread data is not shared by default, you need to mark every variable you wish to share
- threads are unreliable and break easily, try to use multiple processes and/or native fork first

Thanks for the advice. I did not know that the threads duplicate of the memory of the motherthread. I don't think this will be an issue since the motherthread shouldn't be using much memory, but I will keep an eye on it, as well as any unexpected behavior from threads.

Ellie Crabcakes
Feb 1, 2008

Stop emailing my boyfriend Gay Crungus

octobernight posted:

Thanks! I meant to use exists, not defined, and I left out the code that trims the lines so that all whitespace is removed from the ends. I wasn't planning on start exactly at a particular line, rather I was going to jump to XXXX byte location for each thread. I would then throw away whatever line it read it and find the first complete line that has the line start identifier (in my file, the id format is $id). It was something off the top of my head, though.
With a bit of massaging, that could be a workable approach.

Instead of having the threads jump in at a random location and discard lines, stat the fil and, divide the size by the number of threads. Seek to $chunksize*$i, read until $ preceded by a newline. Feed those offsets into the worker threads and you won't have to discard any lines and the almost-guaranteed mangled data resulting therefrom.

That said, this whole setup seems a lot less than ideal. What sort of data is in these files and how are they generated? What happens to the data pulled out?

octobernight
Nov 25, 2004
High Priest of the Christian-Atheist Church

John Big Booty posted:

With a bit of massaging, that could be a workable approach.

Instead of having the threads jump in at a random location and discard lines, stat the fil and, divide the size by the number of threads. Seek to $chunksize*$i, read until $ preceded by a newline. Feed those offsets into the worker threads and you won't have to discard any lines and the almost-guaranteed mangled data resulting therefrom.

That said, this whole setup seems a lot less than ideal. What sort of data is in these files and how are they generated? What happens to the data pulled out?

That's for the suggestion. I'll be testing this implementation today. I greatly simplified the data description to get a general algorithm, but basically I'm working with next generation sequencing data, and I'm trying to find specific reads in a file that I've identified through a different analysis. Basically, I had a huge sequencing file, I run the whole file through a program that identifies key sequences (but only their names, not the original sequences). Thus, I need to pull the raw reads from the file which are then fed to another program.

I've never worked with sequencing data before, nor am I too familiar with working on multi-threaded/multi-core programming, however it's a requirement I pick this up quickly due to the amount of data I'm processing (~hundreds of TBs).

Joe Chip
Jan 4, 2014
Is there a good guide for changes from 5.14 to 5.24? Googling hasn't been helpful and my copy of Programming Perl might never be updated again so I want to know what I'm doing wrong.

homercles
Feb 14, 2010

perldelta is the be all end all of what changed: http://perldoc.perl.org/index-history.html

As to the specifics I tend not to personally delve too much into them, I still have to work on environments stuck in 5.10.0 (not that I'm complaining, it still feels quite modern before smartmatch was butchered). Perhaps a kinder soul (Mithaldu?) could delve into the big things that changed.

Mithaldu
Sep 25, 2007

Let's cuddle. :3:
To be honest, i'm not entirely sure much world-shifting stuff happened?

post-fix deref is amazing if you can know you'll work on new perls

autoderef is dead and gone

Other than that, i'm not really aware of much that i can actually use.

Sebbe
Feb 29, 2004

Mithaldu posted:

To be honest, i'm not entirely sure much world-shifting stuff happened?

post-fix deref is amazing if you can know you'll work on new perls

autoderef is dead and gone

Other than that, i'm not really aware of much that i can actually use.

Subroutine signatures are pretty nice, too.

And there's also the double diamond operator, which is worth knowing about, I suppose.

octobernight
Nov 25, 2004
High Priest of the Christian-Atheist Church
I'm going nuts over the most bizarre issue regarding reading gzipped files using PerlIO-gzip-0.19 module. I have a gzip file that's about 30GB in size. I read 4 lines at a time from it, and print out the total number of 4x lines read, and I get 46,285,997. However, if I decompress the file from the command line, I see that there's actually 462,843,365 4x lines in the file. How am I off by an order of magnitude in the number of 4x lines. Here's my code:

code:
open INPUT,"<:gzip",$input;

while (my $line1 = <INPUT>) {
    $counts++;
    my $line2 = <INPUT>;
    my $line3 = <INPUT>;
    my $line4 = <INPUT>;

    if ($counts % 1000000 == 0) {
      print "$counts\n";
    }
}
What's really stupid is that if I run the exact pseudocode in Python, it does read it correctly. Is there something in my code that is causing unexpected behavior? It always stops in the same place (counts == 46285997 when the loop terminates). Could the gzip have some sort of EoF character at that line but Python "mishandles" it correctly?

Mithaldu
Sep 25, 2007

Let's cuddle. :3:
That looks like an early stop. Make sure you verify the error conditions of all the system calls you're doing after you do them.

octobernight
Nov 25, 2004
High Priest of the Christian-Atheist Church

Mithaldu posted:

That looks like an early stop. Make sure you verify the error conditions of all the system calls you're doing after you do them.

By early stop, do you mean my script terminates incorrectly? As far as I can tell, that's not the case because I have a print statement outside the loop that shows that it completed the loop. The only other thing I can think of is, is it possible that my while condition is incorrect?

code:
while (my $fline = <FORWARD>) ....
Is the incorrect way to read a file, line by line, using Perl? If $fline is a blank line, will it terminate the loop? I had always thought that the only way to terminate is if $fline is undefined, but if a blank line will cause it to terminate, maybe that's what is causing it? I just need to open the file can check if line #185143988+ contains something that is causing it to terminate early.

I should've done this from the beginning. The error happens across different file systems, always on the same line. I was too lazy to actually open the file and look at that line to see if there's something wrong with it. If there is (i.e., a blank line throwing off the reader), that would really suck because we would need to re-verify terabytes of data.

I do recall that the person that had originally generated the data had issues putting the original file together (a bunch of separate gzip files that they parrallel-ly recombined into a single file). If they had inserted some unexpected blank line in there, that's going to screw up everything.

Thanks!

Mithaldu
Sep 25, 2007

Let's cuddle. :3:
See http://perldoc.perl.org/functions/readline.html for more detail. But basically, readline() doesn't throw exceptions. It sets an error variable. So if readline errors out your while loop ends because it returned something undef. The linked article also shows ways to do this safer.

octobernight
Nov 25, 2004
High Priest of the Christian-Atheist Church

Mithaldu posted:

See http://perldoc.perl.org/functions/readline.html for more detail. But basically, readline() doesn't throw exceptions. It sets an error variable. So if readline errors out your while loop ends because it returned something undef. The linked article also shows ways to do this safer.

Thank you! That's super helpful. I'll make sure to update all my scripts to follow these practices.

Mithaldu
Sep 25, 2007

Let's cuddle. :3:
Cheers and good luck with that.

For what it's worth: Modules like autodie and IO::All also hide a lot of this "need to do it to do it safely" behind convenient automation. :)

Hughmoris
Apr 21, 2007
Let's go to the abyss!
The hospital I work for is trying to be more resource efficient and reuse or recycle our office furniture. I learned today that they don't have any sort of inventory management system for the used furniture that they receive, or keep track of what they give back out or recycle. They also don't have a way for users to browse through what used furniture is available to be redeployed.

That got me thinking of a possible project. I have no expectations of actually deploying it but it seems like it would be great for learning. I'm thinking I'd learn a little bit about databases to catalog the furniture, and web frameworks to allow users to see and choose items.

From a design perspective, when tackling a little larger project like this, where do I start? Should I start with picking a database to use and build that up, then go from there? Do I start with learning the web framework like Dancer2 or Mojolicious?

I really want to move beyond writing simple scripts that parse CSVs and the like but I always stall out in the design phase of a larger program, and I don't know anyone IRL that programs that I can bounce ideas off of.

Mithaldu
Sep 25, 2007

Let's cuddle. :3:
As far as databases go, unless you're getting a ready-made database structure provided, just use Postgres. Everything else is either poo poo, insane, or too expensive. And if the framework you end up with doesn't provide an ORM, use DBIx::Class.

As far as framework goes, Dancer2 is you expect it to be very small even 10 years down the road, Mojolicious if you don't mind it being experimental bleeding edge and breaking easy or expect trouble with managing dependencies (there should be none if you just use carton/local::lib), or you just go with Catalyst because it scales well, is stable, and is well documented.

Adbot
ADBOT LOVES YOU

Hughmoris
Apr 21, 2007
Let's go to the abyss!

Mithaldu posted:

As far as databases go, unless you're getting a ready-made database structure provided, just use Postgres. Everything else is either poo poo, insane, or too expensive. And if the framework you end up with doesn't provide an ORM, use DBIx::Class.

As far as framework goes, Dancer2 is you expect it to be very small even 10 years down the road, Mojolicious if you don't mind it being experimental bleeding edge and breaking easy or expect trouble with managing dependencies (there should be none if you just use carton/local::lib), or you just go with Catalyst because it scales well, is stable, and is well documented.

Thanks for the information, I'll see what I can get in to!

  • Locked thread