Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
Erasmus Darwin
Mar 6, 2001
Quick and dirty:

code:
$pid = fork();
if ($pid < 0) {
  die('Fork failed.');
} elsif ($pid == 0) {
  # Child process:
  open(HANDLE, '| /bin/script2');
  print HANDLE stuff;
  close(HANDLE);
  exit;
}

# Parent process continues on...

Adbot
ADBOT LOVES YOU

TiMBuS
Sep 25, 2007

LOL WUT?

I've got a program that keeps track of file checksums and finds duplicate files. At the moment I'm using file md5sums as a key in a hash, and the keys' value is an arrayref of absolute paths to the hashed files. That way I can go over the hash and pick out any values with more than one entry as a duplicate.
The problem is if I scan the same directory twice I'll add the files twice, marking them as duplicates of themselves. It's kind of expensive to hash a file and then check if it's already in an array inside a hash, so I was wondering if there was a better way to store a one-to-many relationship in perl, one where I can look up the 'many' part efficiently.
Sqlite might be a good idea but there's gotta be a simpler way.

ShoulderDaemon
Oct 9, 2003
support goon fund
Taco Defender

TiMBuS posted:

I've got a program that keeps track of file checksums and finds duplicate files. At the moment I'm using file md5sums as a key in a hash, and the keys' value is an arrayref of absolute paths to the hashed files. That way I can go over the hash and pick out any values with more than one entry as a duplicate.
The problem is if I scan the same directory twice I'll add the files twice, marking them as duplicates of themselves. It's kind of expensive to hash a file and then check if it's already in an array inside a hash, so I was wondering if there was a better way to store a one-to-many relationship in perl, one where I can look up the 'many' part efficiently.
Sqlite might be a good idea but there's gotta be a simpler way.

Instead of an array, use another hash keyed by the filename, with values of '1' or something.

Rapportus
Oct 31, 2004
The 4th Blue Man

TiMBuS posted:

I've got a program that keeps track of file checksums and finds duplicate files. At the moment I'm using file md5sums as a key in a hash, and the keys' value is an arrayref of absolute paths to the hashed files. That way I can go over the hash and pick out any values with more than one entry as a duplicate.
The problem is if I scan the same directory twice I'll add the files twice, marking them as duplicates of themselves. It's kind of expensive to hash a file and then check if it's already in an array inside a hash, so I was wondering if there was a better way to store a one-to-many relationship in perl, one where I can look up the 'many' part efficiently.
Sqlite might be a good idea but there's gotta be a simpler way.

Just a random thought, as I don't know if this is useful relative to the performance on your arrayref, but could you use a hashref instead for your inner bins? The inner keys are your absolute paths. The inner values could be a tally if you cared about duplicates.

Edit: Beaten.

TiMBuS
Sep 25, 2007

LOL WUT?

A hashref instead of an arrayref would be a touch more elegant, but it still leaves me with the problem of checksumming the file a second time (this is a bad idea).
I'd much rather find if the file was already hashed and stored by checking if the filename is already in there.

Filburt Shellbach
Nov 6, 2007

Apni tackat say tujay aaj mitta juu gaa!
Maintain a separate hash of filenames you've seen thus far. Skip files you've seen.

TiMBuS
Sep 25, 2007

LOL WUT?

Sartak posted:

Maintain a separate hash of filenames you've seen thus far. Skip files you've seen.
but but MY MEMORY

Ahh its probably the best idea for now, was hoping there was some special way I could treat a hash I couldn't think up, like, I dunno, a trick using multiple keys or something.

Filburt Shellbach
Nov 6, 2007

Apni tackat say tujay aaj mitta juu gaa!
Set::Object probably has better memory and speed performance than hashes.

TiMBuS
Sep 25, 2007

LOL WUT?

Huh thats kind of odd.. is there a particular reason why hashes appear to be so slow compared to this module?

Filburt Shellbach
Nov 6, 2007

Apni tackat say tujay aaj mitta juu gaa!
Sets have fewer operations. You can add or remove items, and check if an item is in the set.

Hashes have to do more work, tracking a value for each item.

Though hashes are optimized as gently caress in Perl, sets can easily be made even faster.

TiMBuS
Sep 25, 2007

LOL WUT?

Ah, no key->value pairs of course, I kinda glossed over it and didn't pick up on that.

Anyway, I implemented the above by storing the filenames as keys, their hashes as values, then when I need to get the duplicates later on I swap the keys and values around into a new temporary hash and the duplicates all show up like before. Not exactly memory efficient but it works and I don't need to rehash files or loop over the entire hash.
Might switch to Set::Object or sqlite later on.

NeoHentaiMaster
Jul 13, 2004
More well adjusted then you'd think.

Erasmus Darwin posted:

Quick and dirty:

code:
$pid = fork();
if ($pid < 0) {
  die('Fork failed.');
} elsif ($pid == 0) {
  # Child process:
  open(HANDLE, '| /bin/script2');
  print HANDLE stuff;
  close(HANDLE);
  exit;
}

# Parent process continues on...

THANK YOU!!! Adding this in worked perfectly! Man, I had actually been reading about the fork function but wasn't able to put it together and realize that is what I needed to do. I figured since a forked process is a 'child' process the parent process would still wait for it to close before it finished. Still not sure exactly how its working but it is working! Guess I got more reading to do but since I have a working example it should make it easier to understand.

Erasmus Darwin
Mar 6, 2001

NeoHentaiMaster posted:

I figured since a forked process is a 'child' process the parent process would still wait for it to close before it finished.

That only happens if you call wait, which as the name implies, makes the parent process wait until a child terminates. Wait's also useful for cleaning up zombie processes -- those are child processes that have already completed but which haven't been cleaned up via wait, yet. That's not really an issue here since the child process doesn't finish until after script1 is done, but it's worth knowing about for other situations.

quote:

Still not sure exactly how its working but it is working! Guess I got more reading to do but since I have a working example it should make it easier to understand.

The thing to understand is what happens when you call fork. It creates an exact duplicate of your current process -- same code, copy of all the data, etc. The only difference is the return value from fork, namely 0 for the child, and the pid of the child for the parent. Also, since the child process is a copy, any changes to variables won't be seen by the parent and vice-versa.

So you've essentially got two copies of script1 running once you call the fork. One's going to just pipe a few lines to script2 and then exit, and it doesn't matter that it's going to spend a while twiddling its thumbs while waiting on script2. The other goes on to finish the regular script1 stuff. It's also worth noting that if you leave off the "exit;" for the child process, it'll continue on executing the parent's code making a funny mess of things since all your post-fork code will get run twice by two different processes.

Also, here's something fun to consider:
code:
perl -e 'for ($i=0; $i<3; ++$i) { fork(); } print ".";'
This will output 8 periods instead of 4. The reason is that for loop continues to execute in the child processes, so you have children spawning children.

Kraus
Jan 17, 2008
Just a small question here:

Is there a command that will execute another perl script?

To make what I'm saying more clear, I have three scripts I want run one after the other. Is there a way to do that using perl?

Filburt Shellbach
Nov 6, 2007

Apni tackat say tujay aaj mitta juu gaa!
See perldoc -f system.

I will not recommend the use of do EXPRESSION here (or anywhere ever).

Nevergirls
Jul 4, 2004

It's not right living this way, not letting others know what's true and what's false.
I have to recommend IPC::System::Simple's systemx and capturex, because they do everything I can't be bothered to think about.

fyreblaize
Nov 2, 2006

Your language is silly
Quick question (maybe):

I'm writing a script that reads a big list of words and prints out the number of words that don't have an a, # of ones that don't have an e, etc. I have the file I/O stuff working fine, but when I go to sort them, it sorts them incorrectly. Here's what I have:

code:
...
# $as is the counter for a's, $es is for e's, etc


my %results = ("\nWords without A's = " => $as, "\nWords without E's = " => $es,  
"\nWords without I's = " => $is, "\nWords without O's = " => $os, "\nWords without U's = " => $us);

@x = %results;

sort{$a <=> $b}@x;

print @x;
for reference:
19154 words do not have an 'u'
14858 words do not have an 'o'
13715 words do not have an 'i'
11953 words do not have an 'a'
10311 words do not have an 'e'

Filburt Shellbach
Nov 6, 2007

Apni tackat say tujay aaj mitta juu gaa!

fyreblaize posted:

sort{$a <=> $b}@x;

<=> compares two numbers. Your strings are not numeric, so each is interpreted as 0 for the purposes of the sort.

cmp compares two strings, which is what you want. Because the default comparator for sort is cmp, just sort @x is all you need.


edit: Oh, it's not really clear what you're actually trying to do. You're assigning the hash %results to the list @x which flattens the hash structure. Maybe sort { $results{$b} <=> $results{$a} } keys %resluts?

Filburt Shellbach fucked around with this message at 03:38 on Dec 6, 2008

TiMBuS
Sep 25, 2007

LOL WUT?

Along with what Sartak said, sort doesnt sort in-place either.
So sort{$a <=> $b}@x; does nothing.

Should be @x = sort{$a <=> $b} @x; but that's probably not the entire solution to your problem..

tef
May 30, 2004

-> some l-system crap ->
aside:

sort in a void context is optimized out, and never executes.

I had a friend who would do sort {$a+=$b} @a

Mario Incandenza
Aug 24, 2000

Tell me, small fry, have you ever heard of the golden Triumph Forks?
Best practices for global configuration data? I'm leaning towards handing out a JSON struct via REST, as it's a fair bit quicker than reading a file off disk and doesn't go stale as easily. Am starting to feel the pain of redundant configuration files.

fansipans
Nov 20, 2005

Internets. Serious Business.

SpeedFrog posted:

Best practices for global configuration data? I'm leaning towards handing out a JSON struct via REST, as it's a fair bit quicker than reading a file off disk and doesn't go stale as easily. Am starting to feel the pain of redundant configuration files.

Global configuration data in what context? What needs configuration? (Perl Questions -> JSON -> REST :raise:)

Mario Incandenza
Aug 24, 2000

Tell me, small fry, have you ever heard of the golden Triumph Forks?
Customer-facing and staff-only Catalyst apps, scripts that get run in crontab, and daemon processes, spread across geographically separate machines. For example, we have a cluster of RPC servers that talk to SMPP gateways, domain registries, that sort of thing, and it would be nice to have some sort of common method of specifying service details from both sides (e.g. from Catalyst when sending an SMS, or from the RPC server when processing a delivery receipt for the same).

I could just as easily use YAML, Storable or DDS, but then I don't get the benefits of native browser support.

tef
May 30, 2004

-> some l-system crap ->
:toot: http://hop.perl.plover.com/book/ :toot:

Higher order perl is available online.

Filburt Shellbach
Nov 6, 2007

Apni tackat say tujay aaj mitta juu gaa!
For god's sakes read it!

Triple Tech
Jul 28, 2006

So what, are you quitting to join Homo Explosion?
I figured everyone here's been a Good Boy (tm) and already has their own, physical copy of HOP.

Mithaldu
Sep 25, 2007

Let's cuddle. :3:
Currently working on a web app that uses sqlite and does some heavy scrunching on market order data grabbed from a third party. Thanks to NYTProf i managed to get its runtime divided by a factor of 400, so i can only chime in on the love for it here. :D

Mithaldu
Sep 25, 2007

Let's cuddle. :3:
Figures that three hours after the previous post i stumble upon an actual question:

I am trying to do a large number of inserts (30 chunks with ~200_000 inserts each) into an SQLite database and am looking for methods to speed that up since doing single inserts for each is pitifully slow.

So far i got this:
code:
my $csv_fh = Tie::Handle::CSV->new(
    csv_parser => Text::CSV_XS->new( {allow_whitespace => 1} ),
    file => $unpackedname,
    simple_reads => 1
);

$dbh->do("BEGIN;");

my $sth = $dbh->prepare(
    "REPLACE INTO orders ( orderid, volume, price ) "
    . " VALUES(?,?,?);"
);

while (my $order = <$csv_fh>) {
    $sth->execute(
        $order->{orderid}, $order->{volremain}, $order->{price}
    );
}
close $csv_fh;

$dbh->do("END;");
Now i was hoping to make it a bit faster by using execute_array. The code for that looks as follows and matches the example given in the DBI documentation:<br>

code:
my $csv_fh = Tie::Handle::CSV->new(
    csv_parser => Text::CSV_XS->new( {allow_whitespace => 1} ),
    file => $unpackedname,
    simple_reads => 1
);

$dbh->do("BEGIN;");

my $sth = $dbh->prepare(
    "REPLACE INTO orders ( orderid, volume, price ) "
    . " VALUES(?,?,?);"
);

my $insert_count = 0;
while (my $order = <$csv_fh>) {
    $sth->bind_param_array(
        ++$insert_count, 
        [ $order->{orderid}, $order->{volremain}, $order->{price} ]
    );
}
close $csv_fh;

$sth->execute_array( { ArrayTupleStatus => \my @tuple_status } );
    
$dbh->do("END;");
############################################################
However that creates this error message:
code:
DBD::SQLite::st execute_array failed: 84490 bind values supplied but 3 expected
Did i do something wrong or is this a limitation of SQLite? Does someone maybe have any suggestions for other ways to speed this up?

Triple Tech
Jul 28, 2006

So what, are you quitting to join Homo Explosion?
My suggestion would be to rip away all the "other" code and just concentrate on getting the statement handle to run with just one bind param. You never know if you mistype something or something is magically expanding. I guess you just have to take that the SQLite driver is "correct". That or it's broken?

Mithaldu
Sep 25, 2007

Let's cuddle. :3:
The reason why i'm suspecting it's an SQLite limitation is because it just plain cannot do things like this:

INSERT into table (one,two) VALUES (1,2),(2,3)

Gonna try restricting it though and see what happens.

Edit: Now i'm confused... DBD::SQLite::st execute_array failed: 1 bind values supplied but 8 expected

Edit2: Ok, looked at the documentation again. I'm a retard. I thought the binds bound data sets, but instead bind columns.

Mithaldu fucked around with this message at 16:00 on Dec 14, 2008

s139252
Jan 1, 1970
test

Mithaldu posted:

I am trying to do a large number of inserts (30 chunks with ~200_000 inserts each) into an SQLite database and am looking for methods to speed that up since doing single inserts for each is pitifully slow.

Try wrapping chunks of inserts in a transaction (ie: BEGIN, INSERT x 10000, COMMIT). Also check out PRAGMA synchronous if you haven't already.

Mithaldu
Sep 25, 2007

Let's cuddle. :3:
Did both of these already. :)

Also, just tried splitting it into chunks of 10000 and it added on 30% more processing time.

That's how it looks right now, and i don't think there's anything more that can be done.
code:
$dbh = DBI->connect( $cfg->param("dbi.source"), '', '', { AutoCommit => 1 } ) or make_error( $cfg->param("errors.sqlconn") );
$dbh->do("PRAGMA synchronous = OFF;") or make_error( $cfg->param("errors.sqlfail") . '<br>' . $dbh->errstr );
$dbh->do("PRAGMA temp_store = MEMORY;") or make_error( $cfg->param("errors.sqlfail") . '<br>' . $dbh->errstr );
$dbh->do("PRAGMA journal_mode = OFF;") or make_error( $cfg->param("errors.sqlfail") . '<br>' . $dbh->errstr );

use Tie::Handle::CSV;
my $csv_fh = Tie::Handle::CSV->new( csv_parser => Text::CSV_XS->new( {allow_whitespace => 1} ), file => $unpackedname ,simple_reads => 1);

$dbh->do("BEGIN;") or make_error( $cfg->param("errors.sqlfail") . '<br>' . $dbh->errstr );

my $sth = prepare_table_update();

my $order_count = 0;

while (my $order = <$csv_fh>) {
    convert_entry_dates($order);
    $order_count++;
    next if $order->{expiry} < $expiretime;
    $sth->execute( $order->{orderid}, $order->{typeid}, $order->{volremain}, $order->{price}, int $order->{expiry}, $order->{bid}, $order->{systemid}, $order->{regionid} ) or make_error( $cfg->param("errors.sqlfail") . '<br>' . $dbh->errstr );
    write_status("Downloading $queue_size files...<br> Preparing insert of $entry->{filename}...<br>$order_count/$max_orders orders...") if ($order_count % 500)==0;
}
close $csv_fh;

write_status("Downloading $queue_size files...<br>Committing $max_orders orders to database...");
$dbh->do("END;") or make_error( $cfg->param("errors.sqlfail") . '<br>' . $dbh->errstr );

Triple Tech
Jul 28, 2006

So what, are you quitting to join Homo Explosion?
There's way too much going on there. Have you tried doing just the inserts? No conversions, no error logging, no file opening, no looping. Try preprocessing the file and then insert that and see how that goes.

Gosh, your code is all over the place. :( Does the SQLite driver not implement $dbh->begin_work and $dbh->commit?

Edit for mantra: Isolate. Isolate. Isolate.

var1ety
Jul 26, 2004

Mithaldu posted:

Did both of these already. :)

Also, just tried splitting it into chunks of 10000 and it added on 30% more processing time.

That's how it looks right now, and i don't think there's anything more that can be done.
code:
$dbh = DBI->connect( $cfg->param("dbi.source"), '', '', { AutoCommit => 1 } ) or make_error( $cfg->param("errors.sqlconn") );

I can't speak to SQLite specifically but in general commits are a very expensive operation. Assuming that SQLite supports transactions you should disable AutoCommit and issue a commit manually at the end of the transaction.

Mithaldu
Sep 25, 2007

Let's cuddle. :3:
as far as SQLite is concerned, these are all exactly the same thing:

AutoCommit=0 / commit
BEGIN / END
BEGIN / COMMIT
begin_work / commit

I've implemented all of these before and the results and benchmarks are all identical. Only reason i am using BEGIN/END is that that's what the SQLite manual talked about and it makes for good visual markers in the code.

However i'd like to hear about why begin_work / commit is better.


Triple Tech posted:

There's way too much going on there.
You mean as far as pinpointing the performance problems goes? Haven't taken it apart yet, because i'm pretty confident that i know which parts impact how much. However, I've also run it through NYTProf and the result seems pretty straight-forward: http://drop.io/3gtix2e


Triple Tech posted:

Gosh, your code is all over the place. :(

Edit for mantra: Isolate. Isolate. Isolate.
Mind explaining what you mean?

I realize i didn't spend much time making it look pretty, but i'd like to hear what more experienced people have to say anyway. :)

Triple Tech
Jul 28, 2006

So what, are you quitting to join Homo Explosion?

Mithaldu posted:

I've implemented all of these before and the results and benchmarks are all identical. Only reason i am using BEGIN/END is that that's what the SQLite manual talked about and it makes for good visual markers in the code.

Code should be written from both a semantic and aesthetic perspective. Design-wise, you should be telling the driver what to do, not having the driver just pass on you talking to the database directly. If the driver abstracts the concept of starting and stopping transactions, and that abstraction has no implementation penalty, then you should use it.

For visual markers in your code, just use space, comments, and an editor with syntax highlighting.

Mithaldu posted:

You mean as far as pinpointing the performance problems goes? Haven't taken it apart yet, because i'm pretty confident that i know which parts impact how much. However, I've also run it through NYTProf and the result seems pretty straight-forward: http://drop.io/3gtix2e
Mind explaining what you mean?

I don't have much (any?) experience with NYTProf, but isolating parts of your code means 1) better designed code 2) being even MORE sure that what you're looking at is the right thing to look at. I mean, how many times have programmers said "I'm totally sure this is this the cause of the problem" only to be corrected by something else a few seconds later. Even if you are right, it doesn't hurt. Your code needs like a fashion makeover, but that's just me and how I would go diagnosing a problem like this.

Because with clean code, I can be more sure that altering the non-execute parts of it will not help improve total runtime. With yours, I'm not so sure. And even if NYTProf proves it, I don't know how to read it. :shobon:

Mithaldu
Sep 25, 2007

Let's cuddle. :3:

Triple Tech posted:

Code should be written from both a semantic and aesthetic perspective. Design-wise, you should be telling the driver what to do, not having the driver just pass on you talking to the database directly. If the driver abstracts the concept of starting and stopping transactions, and that abstraction has no implementation penalty, then you should use it.
Alright, thanks for that. You're completely right and i should stop thinking of the driver as only being something that allows me to talk with the database.

Triple Tech posted:

For visual markers in your code, just use space, comments, and an editor with syntax highlighting.
I do, Komodo IDE is awesome. :) I guess it's some kind of commentary on how ugly this code is when i rely on the SQL statements to guide my eye. I do have to say that i usually write a lot more easily readable and that this was pretty much frankensteined together.

Triple Tech posted:

NYTProf? :shobon:
It's dead-easy, really. :)

Open the file index.html in any browser. Keep in mind it is a line-based profiler. The top table tells you how much realtime it spent on what subroutine call in what module. The table below lists it by module. DumpGetter.pm is the main module i'm working on. If you click on any report in the lower table, it shows you the time spend in the relevant things in a new page.

Triple Tech posted:

Isolation.
That's one thing i just can't quite parse at the moment. I am aware of how to write more readable code, but i am completely self-taught, so i like to hear from more experienced people in plain english what their vocabulary means. Right now all i can think of is: "Split that stuff up in more subroutines." (Which i am incidentally slightly allergic to, what with the massive overhead in subroutine calls in Perl.)

s139252
Jan 1, 1970
test

Mithaldu posted:

You mean as far as pinpointing the performance problems goes? Haven't taken it apart yet, because i'm pretty confident that i know which parts impact how much. However, I've also run it through NYTProf and the result seems pretty straight-forward: http://drop.io/3gtix2e
Mind explaining what you mean?

According to your profiling, your DELETE statements are eating up way more time than the INSERTs. INSERTs seem pretty quick at 0.00014 seconds per query, although I'm not that familiar with Sqlite performance characteristics.

If you take the DELETEs out of the equation, you are spending more time on CSV-related tasks than you are in DBI.

Mithaldu
Sep 25, 2007

Let's cuddle. :3:

satest4 posted:

According to your profiling, your DELETE statements are eating up way more time than the INSERTs.

If you take the DELETEs out of the equation, you are spending more time on CSV-related tasks than you are in DBI.

drat, you're right, I completely missed that. Guess i really should try parsing the files manually.

satest4 posted:

INSERTs seem pretty quick at 0.00014 seconds per query
Thanks a lot for that! The SQLite benchmarks claim that they're faster than MySQL, but the unavailablity of multi-row inserts made me doubt that. Knowing that the inserts actually compare favourably already brings me a good step ahead.

Adbot
ADBOT LOVES YOU

Triple Tech
Jul 28, 2006

So what, are you quitting to join Homo Explosion?
Hmm, how to explain isolation... It's just the clear and easy separation of concepts. Your code is doing many, many things and the task at hand was to examine the performance of executing a particular SQL statement. So, the best way to do that is design your code in such a way that it's easy to enable/disable/shuffle around the different parts, so that in such a case where we want to focus on something, the other stuff isn't bogging it down. I'm sure the more experienced goons can explain what I'm talking about, but they usually don't read this particular thread.

To illustrate, your code is doing a lot:

• Database prep
• Logging
• Error handling
• File manipulation
• File parsing
• Looping
• Translation
• Insertion
• Deletion

See how much stuff that is? Those are all large, separate concepts that each have their own problems. By designing them in such away that they aren't as... intertwined as your code, it makes debugging easier.

Bleh, gotta get back to work...

  • Locked thread