The Perl Short Questions Megathread: executable line noise

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > The Perl Short Questions Megathread: executable line noise

«‹›72 »

Mithaldu: Sep 25, 2007; Let's cuddle.

FreshFeesh posted:

I'm trying to iterate through an alphabetized list and edit/remove entries based on particular criteria, but as I scale up the code I want to make sure I'm being as clean as possible, since I haven't touched this stuff in a while.
code:
@array = ('cat', 'dog 1', 'dog 2', 'dog 3', 'duck');

Desired output:    cat | dog 1, 2, 3 | duck
Here's my code which does the job, but looks fairly inefficient to me:
code:
for my $i (0 .. $#array) {
	$array[$i-1] =~ m/(\w+)\b/;
	if ($array[$i] =~ m/^$1\b/) {
		$array[$i] =~ m/(\d+)$/;
		$output[-1] .= ", $1";
	}
	else { push(@output, $array[$i]); }
}
It outputs correctly, but I'd be grateful for any suggestions you guys have for ways I can clean it up, so when I scale it out to a more complicated list it's easier to manage.

Do you actually care about whether these things are in sequence? If not, you can just split the string on a space, use the first value as a key, and dump things into a hash of arrayrefs.

# ? Jul 24, 2016 12:29

Adbot: ADBOT LOVES YOU

# ? Apr 25, 2024 18:29

FreshFeesh: Jun 3, 2007; Drum Solo

I considered that, but unfortunately my data is more akin to

code:

Favorite Food: Spaghetti
Favorite Food: Tofu Lasagna
Green Beans 1
Green Beans 2
I like monkeys
Favorite Drink: Coors

Which makes splitting on whitespace problematic. I ended up with two different tests, one for lines ending on a number and other specific entries which would contain a colon as a delimiter:

code:

for my $i (0 .. $#newout) {
	if ($newout[$i] =~ m/\d+$/) {
		$newout[$i-1] =~ m/^([a-zA-Z\s]+)\d*/;
		if ($newout[$i] =~ m/^$1/) {
			$newout[$i] =~ m/(\d+)$/;
			$output[-1] .= ", $1";
		} else { 
			push(@output, $newout[$i]); 
		}
	}
	elsif ($newout[$i] =~ m/^Favorite Drink:/ or $newout[$i] =~ m/^Favorite Food:/ or $newout[$i] =~ m/^Something Else:/ or $newout[$i] =~ m/^Dance Party:/) {
		$newout[$i-1] =~ m/^([\w\s]+)/;
		if ($newout[$i] =~ m/^$1/) {
			$newout[$i] =~ m/: ([\w\s]+)$/;
			$output[-1] .= ", $1";
		} else { 
			push(@output, $newout[$i]); 
		}
	}
	else { 
		push(@output, $newout[$i]); 
	}
}

^{(I know my variable names are crap)}

FreshFeesh fucked around with this message at 13:09 on Jul 24, 2016

# ? Jul 24, 2016 13:04

Mithaldu: Sep 25, 2007; Let's cuddle.

Seriously, do fix those var names, they're the main thing making this hard to read. Also, you can condense this:

elsif ($newout[$i] =~ m/^Favorite Drink:/ or $newout[$i] =~ m/^Favorite Food:/ or $newout[$i] =~ m/^Something Else:/ or $newout[$i] =~ m/^Dance Party:/) {

to

elsif ($newout[$i] =~ m/^(Favorite Drink|Favorite Food|Something Else|Dance Party):/) {

Also add a next; after each operation where you consider the line concluded, and you can get rid of a lot of the control structure.

code:

my @summary, @lines;
for my $i ( 0 .. $#lines ) {
    my $line      = $lines[$i];
    my $prev_line = $lines[$prev];
    if ( $line =~ m/\d+$/ ) {
        $prev_line =~ m/^([a-zA-Z\s]+)\d*/;
        if ( $line =~ m/^$1/ ) {
            $line =~ m/(\d+)$/;
            $summary[-1] .= ", $1";
            next;
        }
        push @summary, $line;
        next;
    }

    if ( $line =~ m/^(Favorite Drink|Favorite Food|Something Else|Dance Party):/ ) {
        $prev_line =~ m/^([\w\s]+)/;
        if ( $line =~ m/^$1/ ) {
            $line =~ m/: ([\w\s]+)$/;
            $summary[-1] .= ", $1";
            next;
        }
        push @summary, $line;
        next;
    }

    push @summary, $line;
}

# ? Jul 24, 2016 13:23

FreshFeesh: Jun 3, 2007; Drum Solo

Thank you for the suggestions, they're certainly helping and things look a lot better.

# ? Jul 24, 2016 13:38

octobernight: Nov 25, 2004; High Priest of the Christian-Atheist Church

I'm working with large 200-500 Gb text files, and I want to find lines that match a pre-defined list line ids and return the next line following the match. Here's my starting code:

code:

my %ids = ("foo",undef, "bar",undef, "baz",undef );
open(INPUT, "big.txt");
while (my $line = <INPUT>) {
	my $next = <INPUT>;
	if (defined $ids{$line}) {
		$ids{$line} = $next;
	}
}

When I run this code, according to iostat, my bottleneck is CPU (roughly 80-90% cpu is user, only 0.10% is iowait). This makes sense because the system's bandwidth is 200-300 Gb/s so I/O shouldn't be an issue. My machine has 16 cpus, so I suspect that if I run 16 threads, each reading at a different chunk of data, it would greatly speed up this code. My idea would be to have each thread seek into 16 positions into the file and start processing from there. I haven't worked out the details (need to figure out what happens when thread doesn't start in the right location as the positions are just estimates), but I wanted to know what other people thought, and whether they had any alternative suggestions?

# ? Aug 1, 2016 20:48

Ellie Crabcakes: Feb 1, 2008; Stop emailing my boyfriend Gay Crungus

octobernight posted:

I'm working with large 200-500 Gb text files, and I want to find lines that match a pre-defined list line ids and return the next line following the match. Here's my starting code:
code:
my %ids = ("foo",undef, "bar",undef, "baz",undef );
open(INPUT, "big.txt");
while (my $line = <INPUT>) {
	my $next = <INPUT>;
	if (defined $ids{$line}) {
		$ids{$line} = $next;
	}
}
When I run this code, according to iostat, my bottleneck is CPU (roughly 80-90% cpu is user, only 0.10% is iowait). This makes sense because the system's bandwidth is 200-300 Gb/s so I/O shouldn't be an issue. My machine has 16 cpus, so I suspect that if I run 16 threads, each reading at a different chunk of data, it would greatly speed up this code. My idea would be to have each thread seek into 16 positions into the file and start processing from there. I haven't worked out the details (need to figure out what happens when thread doesn't start in the right location as the positions are just estimates), but I wanted to know what other people thought, and whether they had any alternative suggestions?

This code won''t work.

1) It won't match the ids unless you chomp the line first.
2)defined $ids{foo} is always going to return false because the value of $id{foo} is set to undef. What you want is exists $id{foo}

Unless the id line and the content line have fixed widths--and I imagine they do not--then threads aren't going to help. First off, you won't know how many threads to spawn because you have to read through the entire file to get a line count. Then, each thread is going to have to read lines and throw them away until they get to their starting point.

# ? Aug 2, 2016 03:44

homercles: Feb 14, 2010

You can make threads work, certainly you can. (well I'd use Parallel::ForkManager mostly because true threads aren't needed)

Is there a way to differentiate the key lines from the value lines? Do keys match a certain pattern? If you seek to a part of the file, assume you're mid-line and read in the current line assuming it's junk, then scan in the next line, can you determine via a regex if that line is a key, or if it's a value? If so then this is the easiest case, each child can run independent of the parent.

Otherwise, just have each thread assume that either one of the following are true, the first line in its slice is a key and subsequent is a value, OR that the first line is a value and subsequent lines form key/value pairs. When done, send both results to the parent, and the number of lines processed. Once the parent has all processed results from its children, it can reconstruct whether the first line each slice processed was a key or a value (based on the number of scanned in lines each child sends to the parent) and choose the processed results accordingly. It might be space prohibitive and it's going to be a bit pesky to write, but it's conceptually simple.

Alternately, maybe you can use gnu parallel for this.

A test program: perl -e 'BEGIN { $x = "x"x60 } print "key $_ $x\nval $_ $x\n" for 1 .. 214' | parallel --no-notice --block 500 --pipe -L 100 egrep --line-buffered -A1 -n "'^key .*7 '"

How you'd use it: cat BIGFILE.txt | parallel --no-notice --pipe -L 50000000 fgrep --line-buffered -A1 -x -f PATTERNSFILE.txt

You would have to parse the output to make sure matched keys are on 'odd' lines, GNU Parallel will limit the maximum number of line sent to fgrep forcing each new parallel block to be on a clean guaranteed key line. GNU Parallel is also written in Perl so it might not be much of a performance improvement though.

# ? Aug 2, 2016 09:05

Mithaldu: Sep 25, 2007; Let's cuddle.

octobernight posted:

threads

There's one thing you need to be aware of:

Threads in Perl were created by Activestate on a contract from Microsoft to create fork emulation in Perl on Windows.

This means whenever you start a thread in Perl you actually start a hacky fork emulator, with the consequences being:

- thread startup is super expensive, start as few as you can and reuse
- thread startup duplicates the memory use of the mother thread; so start threads as early as possible, with as few things loaded as possible
- thread data is not shared by default, you need to mark every variable you wish to share
- threads are unreliable and break easily, try to use multiple processes and/or native fork first

# ? Aug 2, 2016 10:55

octobernight: Nov 25, 2004; High Priest of the Christian-Atheist Church

Thanks for the replies all.

John Big Booty posted:

This code won''t work.

1) It won't match the ids unless you chomp the line first.
2)defined $ids{foo} is always going to return false because the value of $id{foo} is set to undef. What you want is exists $id{foo}

Unless the id line and the content line have fixed widths--and I imagine they do not--then threads aren't going to help. First off, you won't know how many threads to spawn because you have to read through the entire file to get a line count. Then, each thread is going to have to read lines and throw them away until they get to their starting point.

Thanks! I meant to use exists, not defined, and I left out the code that trims the lines so that all whitespace is removed from the ends. I wasn't planning on start exactly at a particular line, rather I was going to jump to XXXX byte location for each thread. I would then throw away whatever line it read it and find the first complete line that has the line start identifier (in my file, the id format is $id). It was something off the top of my head, though.

homercles posted:

You can make threads work, certainly you can. (well I'd use Parallel::ForkManager mostly because true threads aren't needed)

Is there a way to differentiate the key lines from the value lines? Do keys match a certain pattern? If you seek to a part of the file, assume you're mid-line and read in the current line assuming it's junk, then scan in the next line, can you determine via a regex if that line is a key, or if it's a value? If so then this is the easiest case, each child can run independent of the parent.

Yes, there is. Id lines start with "$" so it makes it easy to figure out if it's an id line or value line. I'll probably try my seek strategy then.

quote:

Alternately, maybe you can use gnu parallel for this.

A test program: perl -e 'BEGIN { $x = "x"x60 } print "key $_ $x\nval $_ $x\n" for 1 .. 214' | parallel --no-notice --block 500 --pipe -L 100 egrep --line-buffered -A1 -n "'^key .*7 '"

How you'd use it: cat BIGFILE.txt | parallel --no-notice --pipe -L 50000000 fgrep --line-buffered -A1 -x -f PATTERNSFILE.txt

You would have to parse the output to make sure matched keys are on 'odd' lines, GNU Parallel will limit the maximum number of line sent to fgrep forcing each new parallel block to be on a clean guaranteed key line. GNU Parallel is also written in Perl so it might not be much of a performance improvement though.

Thanks for this, too. I've never used the parallel in GNU before, but I'm reading over it now.

Mithaldu posted:

There's one thing you need to be aware of:

Threads in Perl were created by Activestate on a contract from Microsoft to create fork emulation in Perl on Windows.

This means whenever you start a thread in Perl you actually start a hacky fork emulator, with the consequences being:

- thread startup is super expensive, start as few as you can and reuse
- thread startup duplicates the memory use of the mother thread; so start threads as early as possible, with as few things loaded as possible
- thread data is not shared by default, you need to mark every variable you wish to share
- threads are unreliable and break easily, try to use multiple processes and/or native fork first

Thanks for the advice. I did not know that the threads duplicate of the memory of the motherthread. I don't think this will be an issue since the motherthread shouldn't be using much memory, but I will keep an eye on it, as well as any unexpected behavior from threads.

# ? Aug 2, 2016 18:57

Ellie Crabcakes: Feb 1, 2008; Stop emailing my boyfriend Gay Crungus

octobernight posted:

Thanks! I meant to use exists, not defined, and I left out the code that trims the lines so that all whitespace is removed from the ends. I wasn't planning on start exactly at a particular line, rather I was going to jump to XXXX byte location for each thread. I would then throw away whatever line it read it and find the first complete line that has the line start identifier (in my file, the id format is $id). It was something off the top of my head, though.

With a bit of massaging, that could be a workable approach.

Instead of having the threads jump in at a random location and discard lines, stat the fil and, divide the size by the number of threads. Seek to $chunksize*$i, read until $ preceded by a newline. Feed those offsets into the worker threads and you won't have to discard any lines and the almost-guaranteed mangled data resulting therefrom.

That said, this whole setup seems a lot less than ideal. What sort of data is in these files and how are they generated? What happens to the data pulled out?

# ? Aug 3, 2016 01:40

octobernight: Nov 25, 2004; High Priest of the Christian-Atheist Church

John Big Booty posted:

With a bit of massaging, that could be a workable approach.

Instead of having the threads jump in at a random location and discard lines, stat the fil and, divide the size by the number of threads. Seek to $chunksize*$i, read until $ preceded by a newline. Feed those offsets into the worker threads and you won't have to discard any lines and the almost-guaranteed mangled data resulting therefrom.

That said, this whole setup seems a lot less than ideal. What sort of data is in these files and how are they generated? What happens to the data pulled out?

That's for the suggestion. I'll be testing this implementation today. I greatly simplified the data description to get a general algorithm, but basically I'm working with next generation sequencing data, and I'm trying to find specific reads in a file that I've identified through a different analysis. Basically, I had a huge sequencing file, I run the whole file through a program that identifies key sequences (but only their names, not the original sequences). Thus, I need to pull the raw reads from the file which are then fed to another program.

I've never worked with sequencing data before, nor am I too familiar with working on multi-threaded/multi-core programming, however it's a requirement I pick this up quickly due to the amount of data I'm processing (~hundreds of TBs).

# ? Aug 3, 2016 18:37

Joe Chip: Jan 4, 2014

Is there a good guide for changes from 5.14 to 5.24? Googling hasn't been helpful and my copy of Programming Perl might never be updated again so I want to know what I'm doing wrong.

# ? Nov 5, 2016 03:31

homercles: Feb 14, 2010

perldelta is the be all end all of what changed: http://perldoc.perl.org/index-history.html

As to the specifics I tend not to personally delve too much into them, I still have to work on environments stuck in 5.10.0 (not that I'm complaining, it still feels quite modern before smartmatch was butchered). Perhaps a kinder soul (Mithaldu?) could delve into the big things that changed.

# ? Nov 5, 2016 06:17

Mithaldu: Sep 25, 2007; Let's cuddle.

To be honest, i'm not entirely sure much world-shifting stuff happened?

post-fix deref is amazing if you can know you'll work on new perls

autoderef is dead and gone

Other than that, i'm not really aware of much that i can actually use.

# ? Nov 6, 2016 00:12

Sebbe: Feb 29, 2004

Mithaldu posted:

To be honest, i'm not entirely sure much world-shifting stuff happened?

post-fix deref is amazing if you can know you'll work on new perls

autoderef is dead and gone

Other than that, i'm not really aware of much that i can actually use.

Subroutine signatures are pretty nice, too.

And there's also the double diamond operator, which is worth knowing about, I suppose.

# ? Nov 9, 2016 10:40

octobernight: Nov 25, 2004; High Priest of the Christian-Atheist Church

I'm going nuts over the most bizarre issue regarding reading gzipped files using PerlIO-gzip-0.19 module. I have a gzip file that's about 30GB in size. I read 4 lines at a time from it, and print out the total number of 4x lines read, and I get 46,285,997. However, if I decompress the file from the command line, I see that there's actually 462,843,365 4x lines in the file. How am I off by an order of magnitude in the number of 4x lines. Here's my code:

code:

open INPUT,"<:gzip",$input;

while (my $line1 = <INPUT>) {
    $counts++;
    my $line2 = <INPUT>;
    my $line3 = <INPUT>;
    my $line4 = <INPUT>;

    if ($counts % 1000000 == 0) {
      print "$counts\n";
    }
}

What's really stupid is that if I run the exact pseudocode in Python, it does read it correctly. Is there something in my code that is causing unexpected behavior? It always stops in the same place (counts == 46285997 when the loop terminates). Could the gzip have some sort of EoF character at that line but Python "mishandles" it correctly?

# ? Feb 7, 2017 17:25

Mithaldu: Sep 25, 2007; Let's cuddle.

That looks like an early stop. Make sure you verify the error conditions of all the system calls you're doing after you do them.

# ? Feb 7, 2017 17:28

octobernight: Nov 25, 2004; High Priest of the Christian-Atheist Church

Mithaldu posted:

That looks like an early stop. Make sure you verify the error conditions of all the system calls you're doing after you do them.

By early stop, do you mean my script terminates incorrectly? As far as I can tell, that's not the case because I have a print statement outside the loop that shows that it completed the loop. The only other thing I can think of is, is it possible that my while condition is incorrect?

code:

while (my $fline = <FORWARD>) ....

Is the incorrect way to read a file, line by line, using Perl? If $fline is a blank line, will it terminate the loop? I had always thought that the only way to terminate is if $fline is undefined, but if a blank line will cause it to terminate, maybe that's what is causing it? I just need to open the file can check if line #185143988+ contains something that is causing it to terminate early.

I should've done this from the beginning. The error happens across different file systems, always on the same line. I was too lazy to actually open the file and look at that line to see if there's something wrong with it. If there is (i.e., a blank line throwing off the reader), that would really suck because we would need to re-verify terabytes of data.

I do recall that the person that had originally generated the data had issues putting the original file together (a bunch of separate gzip files that they parrallel-ly recombined into a single file). If they had inserted some unexpected blank line in there, that's going to screw up everything.

Thanks!

# ? Feb 7, 2017 18:31

Mithaldu: Sep 25, 2007; Let's cuddle.

See http://perldoc.perl.org/functions/readline.html for more detail. But basically, readline() doesn't throw exceptions. It sets an error variable. So if readline errors out your while loop ends because it returned something undef. The linked article also shows ways to do this safer.

# ? Feb 7, 2017 18:39

octobernight: Nov 25, 2004; High Priest of the Christian-Atheist Church

Mithaldu posted:

See http://perldoc.perl.org/functions/readline.html for more detail. But basically, readline() doesn't throw exceptions. It sets an error variable. So if readline errors out your while loop ends because it returned something undef. The linked article also shows ways to do this safer.

Thank you! That's super helpful. I'll make sure to update all my scripts to follow these practices.

# ? Feb 7, 2017 18:44

Mithaldu: Sep 25, 2007; Let's cuddle.

Cheers and good luck with that.

For what it's worth: Modules like autodie and IO::All also hide a lot of this "need to do it to do it safely" behind convenient automation.

# ? Feb 7, 2017 23:08

Hughmoris: Apr 21, 2007; Let's go to the abyss!

The hospital I work for is trying to be more resource efficient and reuse or recycle our office furniture. I learned today that they don't have any sort of inventory management system for the used furniture that they receive, or keep track of what they give back out or recycle. They also don't have a way for users to browse through what used furniture is available to be redeployed.

That got me thinking of a possible project. I have no expectations of actually deploying it but it seems like it would be great for learning. I'm thinking I'd learn a little bit about databases to catalog the furniture, and web frameworks to allow users to see and choose items.

From a design perspective, when tackling a little larger project like this, where do I start? Should I start with picking a database to use and build that up, then go from there? Do I start with learning the web framework like Dancer2 or Mojolicious?

I really want to move beyond writing simple scripts that parse CSVs and the like but I always stall out in the design phase of a larger program, and I don't know anyone IRL that programs that I can bounce ideas off of.

# ? Feb 18, 2017 01:33

Mithaldu: Sep 25, 2007; Let's cuddle.

As far as databases go, unless you're getting a ready-made database structure provided, just use Postgres. Everything else is either poo poo, insane, or too expensive. And if the framework you end up with doesn't provide an ORM, use DBIx::Class.

As far as framework goes, Dancer2 is you expect it to be very small even 10 years down the road, Mojolicious if you don't mind it being experimental bleeding edge and breaking easy or expect trouble with managing dependencies (there should be none if you just use carton/local::lib), or you just go with Catalyst because it scales well, is stable, and is well documented.

# ? Feb 18, 2017 10:14

Adbot: ADBOT LOVES YOU

# ? Apr 25, 2024 18:29

Hughmoris: Apr 21, 2007; Let's go to the abyss!

Mithaldu posted:

As far as databases go, unless you're getting a ready-made database structure provided, just use Postgres. Everything else is either poo poo, insane, or too expensive. And if the framework you end up with doesn't provide an ORM, use DBIx::Class.

As far as framework goes, Dancer2 is you expect it to be very small even 10 years down the road, Mojolicious if you don't mind it being experimental bleeding edge and breaking easy or expect trouble with managing dependencies (there should be none if you just use carton/local::lib), or you just go with Catalyst because it scales well, is stable, and is well documented.

Thanks for the information, I'll see what I can get in to!

# ? Feb 19, 2017 00:11

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > The Perl Short Questions Megathread: executable line noise

«‹›72 »