Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
octobernight
Nov 25, 2004
High Priest of the Christian-Atheist Church
I have a question about how unix system commands work in perl. I have my PATH environment variable set to the directory of a program I want to run. I can type:

> my_program
my_program executed

However, if I try the same thing in my perl script with this line:

print `my_program`;

I get the error message, "Can't exec "my_program": No such file or directory at ../test.pl line 11." Why is this so?

Adbot
ADBOT LOVES YOU

octobernight
Nov 25, 2004
High Priest of the Christian-Atheist Church

ShoulderDaemon posted:

What? No, UNIX doesn't exec programs from the cwd unless "." is a member of the PATH environment variable. It has everything to do with his PATH.

octobernight: Have your perl program print the contents of $ENV{"PATH"} to verify that it is correct; there may be some subtlety in how you are starting your perl program that is preventing it from inheriting the environment you think it is.

I had perl print the $ENV{"PATH"} and it did show up correctly. I also checked and "." is on my PATH environment variable. I'm really stumped whether this is a perl problem or a unix problem.

octobernight
Nov 25, 2004
High Priest of the Christian-Atheist Church

ShoulderDaemon posted:

I'd check for something like an invisible unicode character or similar in the `` in your perl program. If you use something like `/full/path/to/my_program` does it work?

Also, having "." in your PATH is not normally considered a good idea.

I checked for any invisible characters and couldn't find any. I also tried using the full path, and it worked correctly. Very strange. Here's the error message I am receiving:

#This works
print `~/programs/my_program/bin/my_program -help`;

#This doesn't work
print `my_program -help`
Can't exec "my_program": No such file or directory at ./test.pl line 15.

#This works on the unix command line in any other directory
> my_program -help

I also removed the "." from my path. I only added to test if that could've been a problem. Thanks for all your help, though. It's just that this problem doesn't make too much sense to me.

octobernight
Nov 25, 2004
High Priest of the Christian-Atheist Church

ShoulderDaemon posted:

Well, when perl handles backticks it invokes a subshell; it's possible that the subshell's init process is overwriting your PATH variable.

Check the output of print `echo \$PATH` and report if that is correct.

I checked that, and the program does display on the correct path. Since I'm on such a tight time constraint, and I am only verifying another project, I think I'll hardcode in the path for now. I will need to figure out why this isn't working later. Thank you for all your help, though.

Mario Incandenza posted:

FindBin (part of the core) will be useful here, if the script you're trying to run is located in the same directory as your Perl:

I will try this later tonight once I read more about how FindBin works. Thanks for the link.

octobernight
Nov 25, 2004
High Priest of the Christian-Atheist Church
I've build a helper.pm full of helpful functions. I'm now working on an interactive command line that allows me to easily call functions from helper.pm. However, I'm having problems getting the syntax down.

For example, if I have a defined function test which prints "Hello world", then the following would call that function:

my $function = "test"
&{$function}();

However, if I have a defined function test in helper.pm, I don't know how to call it from another file. All the following fail:

&{Helper::$function}();
&Helper::{$function}();
Helper::&{$function}();
$Helper::&{$function}();

Does anyone know the solution to this problem? Thanks!

octobernight
Nov 25, 2004
High Priest of the Christian-Atheist Church

Mario Incandenza posted:

Well, calling functions indirectly and tightly coupled like that is kinda asking for trouble, but here's how to do it (even when using strict):

code:
sub foo { 'foo' }

my $method_name = 'foo';
my $ref = \&{ $method_name };

warn $ref->();
But you should really be using a dispatch table, or consider swapping to an OO-based system with proper reflection, instead of resorting to nasty hacks or manually defined mappings.

Yeah, this isn't the best solution. I am using this for testing my in house development tools and haven't had time to come up with a more elegant method due to other pressing issues. My main problem was my subroutines are in a different file, so I'm not sure how to call those functions. I was unclear about the syntax to make the call. However, I've figured it out.

code:
package foo
sub test {
  print "Hello\n";
}
1;
code:
use foo

my $method_name = 'test';
my $class_name = 'foo';
#Call test in foo.pm?
$class_name->$method_name();
Thanks for the reply!

octobernight
Nov 25, 2004
High Priest of the Christian-Atheist Church
I'm trying to extract names out of a tree structure and replace the labels with my own labels. The format is called newick, so this represents a full binary tree: ((A,B),(C,D)).

Here's my code to get the names and make a rename map.

code:
        
	if (@matches = $file =~ /[\(,]([^,\);]+)/smg) {
		foreach my $match (@matches) {
			my $name = trim($match);
			$name_map{$name} = $name + "_foo";
		}
	}
Here's where I get confused. The names can be substrings of each other, so if I do a global search/replace to replace A with A_foo, it will also replace AA with A_fooA. How do I prevent this? It feels like I'm so close since I have the part that needs to get replaced, but I have no idea how to finish the very last step. I imagine the code will be very similar to this snippet.

EDIT: Never mind, I figured it out. It doesn't feel very elegant, but I think it works:
code:
        #Requires that labels end with a ,) or ;
	foreach my $key (keys %name_map) {
		$file =~ s/$key([,\);])/$name_map{$key}$1/g;
	}

octobernight fucked around with this message at 05:10 on Apr 22, 2011

octobernight
Nov 25, 2004
High Priest of the Christian-Atheist Church

qntm posted:

That solution does work, but I foresee problems if the existing tree already contains elements ending in "_foo", e.g. "A" and "A_foo". The existing "A" will be converted to "A_foo" on the first pass, while the existing "A_foo" will be left alone. On the second pass, you will end up with "A_foo_foo" and "A_foo_foo" respectively, which is wrong.

This might work instead:

code:
$string =~ s/([\(,])([^,\);]+)/$1$2_foo/sg;

Ah, you're correct. I didn't realize that problem. Thank you for pointing that out! I'll try your code snippet.

octobernight
Nov 25, 2004
High Priest of the Christian-Atheist Church
I'm working with large 200-500 Gb text files, and I want to find lines that match a pre-defined list line ids and return the next line following the match. Here's my starting code:

code:
my %ids = ("foo",undef, "bar",undef, "baz",undef );
open(INPUT, "big.txt");
while (my $line = <INPUT>) {
	my $next = <INPUT>;
	if (defined $ids{$line}) {
		$ids{$line} = $next;
	}
}
When I run this code, according to iostat, my bottleneck is CPU (roughly 80-90% cpu is user, only 0.10% is iowait). This makes sense because the system's bandwidth is 200-300 Gb/s so I/O shouldn't be an issue. My machine has 16 cpus, so I suspect that if I run 16 threads, each reading at a different chunk of data, it would greatly speed up this code. My idea would be to have each thread seek into 16 positions into the file and start processing from there. I haven't worked out the details (need to figure out what happens when thread doesn't start in the right location as the positions are just estimates), but I wanted to know what other people thought, and whether they had any alternative suggestions?

octobernight
Nov 25, 2004
High Priest of the Christian-Atheist Church
Thanks for the replies all.

John Big Booty posted:

This code won''t work.

1) It won't match the ids unless you chomp the line first.
2)defined $ids{foo} is always going to return false because the value of $id{foo} is set to undef. What you want is exists $id{foo}

Unless the id line and the content line have fixed widths--and I imagine they do not--then threads aren't going to help. First off, you won't know how many threads to spawn because you have to read through the entire file to get a line count. Then, each thread is going to have to read lines and throw them away until they get to their starting point.

Thanks! I meant to use exists, not defined, and I left out the code that trims the lines so that all whitespace is removed from the ends. I wasn't planning on start exactly at a particular line, rather I was going to jump to XXXX byte location for each thread. I would then throw away whatever line it read it and find the first complete line that has the line start identifier (in my file, the id format is $id). It was something off the top of my head, though.

homercles posted:

You can make threads work, certainly you can. (well I'd use Parallel::ForkManager mostly because true threads aren't needed)

Is there a way to differentiate the key lines from the value lines? Do keys match a certain pattern? If you seek to a part of the file, assume you're mid-line and read in the current line assuming it's junk, then scan in the next line, can you determine via a regex if that line is a key, or if it's a value? If so then this is the easiest case, each child can run independent of the parent.

Yes, there is. Id lines start with "$" so it makes it easy to figure out if it's an id line or value line. I'll probably try my seek strategy then.

quote:

Alternately, maybe you can use gnu parallel for this.

A test program: perl -e 'BEGIN { $x = "x"x60 } print "key $_ $x\nval $_ $x\n" for 1 .. 214' | parallel --no-notice --block 500 --pipe -L 100 egrep --line-buffered -A1 -n "'^key .*7 '"

How you'd use it: cat BIGFILE.txt | parallel --no-notice --pipe -L 50000000 fgrep --line-buffered -A1 -x -f PATTERNSFILE.txt

You would have to parse the output to make sure matched keys are on 'odd' lines, GNU Parallel will limit the maximum number of line sent to fgrep forcing each new parallel block to be on a clean guaranteed key line. GNU Parallel is also written in Perl so it might not be much of a performance improvement though.

Thanks for this, too. I've never used the parallel in GNU before, but I'm reading over it now.


Mithaldu posted:

There's one thing you need to be aware of:

Threads in Perl were created by Activestate on a contract from Microsoft to create fork emulation in Perl on Windows.

This means whenever you start a thread in Perl you actually start a hacky fork emulator, with the consequences being:

- thread startup is super expensive, start as few as you can and reuse
- thread startup duplicates the memory use of the mother thread; so start threads as early as possible, with as few things loaded as possible
- thread data is not shared by default, you need to mark every variable you wish to share
- threads are unreliable and break easily, try to use multiple processes and/or native fork first

Thanks for the advice. I did not know that the threads duplicate of the memory of the motherthread. I don't think this will be an issue since the motherthread shouldn't be using much memory, but I will keep an eye on it, as well as any unexpected behavior from threads.

octobernight
Nov 25, 2004
High Priest of the Christian-Atheist Church

John Big Booty posted:

With a bit of massaging, that could be a workable approach.

Instead of having the threads jump in at a random location and discard lines, stat the fil and, divide the size by the number of threads. Seek to $chunksize*$i, read until $ preceded by a newline. Feed those offsets into the worker threads and you won't have to discard any lines and the almost-guaranteed mangled data resulting therefrom.

That said, this whole setup seems a lot less than ideal. What sort of data is in these files and how are they generated? What happens to the data pulled out?

That's for the suggestion. I'll be testing this implementation today. I greatly simplified the data description to get a general algorithm, but basically I'm working with next generation sequencing data, and I'm trying to find specific reads in a file that I've identified through a different analysis. Basically, I had a huge sequencing file, I run the whole file through a program that identifies key sequences (but only their names, not the original sequences). Thus, I need to pull the raw reads from the file which are then fed to another program.

I've never worked with sequencing data before, nor am I too familiar with working on multi-threaded/multi-core programming, however it's a requirement I pick this up quickly due to the amount of data I'm processing (~hundreds of TBs).

octobernight
Nov 25, 2004
High Priest of the Christian-Atheist Church
I'm going nuts over the most bizarre issue regarding reading gzipped files using PerlIO-gzip-0.19 module. I have a gzip file that's about 30GB in size. I read 4 lines at a time from it, and print out the total number of 4x lines read, and I get 46,285,997. However, if I decompress the file from the command line, I see that there's actually 462,843,365 4x lines in the file. How am I off by an order of magnitude in the number of 4x lines. Here's my code:

code:
open INPUT,"<:gzip",$input;

while (my $line1 = <INPUT>) {
    $counts++;
    my $line2 = <INPUT>;
    my $line3 = <INPUT>;
    my $line4 = <INPUT>;

    if ($counts % 1000000 == 0) {
      print "$counts\n";
    }
}
What's really stupid is that if I run the exact pseudocode in Python, it does read it correctly. Is there something in my code that is causing unexpected behavior? It always stops in the same place (counts == 46285997 when the loop terminates). Could the gzip have some sort of EoF character at that line but Python "mishandles" it correctly?

octobernight
Nov 25, 2004
High Priest of the Christian-Atheist Church

Mithaldu posted:

That looks like an early stop. Make sure you verify the error conditions of all the system calls you're doing after you do them.

By early stop, do you mean my script terminates incorrectly? As far as I can tell, that's not the case because I have a print statement outside the loop that shows that it completed the loop. The only other thing I can think of is, is it possible that my while condition is incorrect?

code:
while (my $fline = <FORWARD>) ....
Is the incorrect way to read a file, line by line, using Perl? If $fline is a blank line, will it terminate the loop? I had always thought that the only way to terminate is if $fline is undefined, but if a blank line will cause it to terminate, maybe that's what is causing it? I just need to open the file can check if line #185143988+ contains something that is causing it to terminate early.

I should've done this from the beginning. The error happens across different file systems, always on the same line. I was too lazy to actually open the file and look at that line to see if there's something wrong with it. If there is (i.e., a blank line throwing off the reader), that would really suck because we would need to re-verify terabytes of data.

I do recall that the person that had originally generated the data had issues putting the original file together (a bunch of separate gzip files that they parrallel-ly recombined into a single file). If they had inserted some unexpected blank line in there, that's going to screw up everything.

Thanks!

Adbot
ADBOT LOVES YOU

octobernight
Nov 25, 2004
High Priest of the Christian-Atheist Church

Mithaldu posted:

See http://perldoc.perl.org/functions/readline.html for more detail. But basically, readline() doesn't throw exceptions. It sets an error variable. So if readline errors out your while loop ends because it returned something undef. The linked article also shows ways to do this safer.

Thank you! That's super helpful. I'll make sure to update all my scripts to follow these practices.

  • Locked thread