|
Quick and dirty:code:
|
# ? Dec 1, 2008 01:11 |
|
|
# ? May 17, 2024 19:27 |
|
I've got a program that keeps track of file checksums and finds duplicate files. At the moment I'm using file md5sums as a key in a hash, and the keys' value is an arrayref of absolute paths to the hashed files. That way I can go over the hash and pick out any values with more than one entry as a duplicate. The problem is if I scan the same directory twice I'll add the files twice, marking them as duplicates of themselves. It's kind of expensive to hash a file and then check if it's already in an array inside a hash, so I was wondering if there was a better way to store a one-to-many relationship in perl, one where I can look up the 'many' part efficiently. Sqlite might be a good idea but there's gotta be a simpler way.
|
# ? Dec 1, 2008 07:07 |
|
TiMBuS posted:I've got a program that keeps track of file checksums and finds duplicate files. At the moment I'm using file md5sums as a key in a hash, and the keys' value is an arrayref of absolute paths to the hashed files. That way I can go over the hash and pick out any values with more than one entry as a duplicate. Instead of an array, use another hash keyed by the filename, with values of '1' or something.
|
# ? Dec 1, 2008 07:19 |
|
TiMBuS posted:I've got a program that keeps track of file checksums and finds duplicate files. At the moment I'm using file md5sums as a key in a hash, and the keys' value is an arrayref of absolute paths to the hashed files. That way I can go over the hash and pick out any values with more than one entry as a duplicate. Just a random thought, as I don't know if this is useful relative to the performance on your arrayref, but could you use a hashref instead for your inner bins? The inner keys are your absolute paths. The inner values could be a tally if you cared about duplicates. Edit: Beaten.
|
# ? Dec 1, 2008 07:21 |
|
A hashref instead of an arrayref would be a touch more elegant, but it still leaves me with the problem of checksumming the file a second time (this is a bad idea). I'd much rather find if the file was already hashed and stored by checking if the filename is already in there.
|
# ? Dec 1, 2008 07:55 |
|
Maintain a separate hash of filenames you've seen thus far. Skip files you've seen.
|
# ? Dec 1, 2008 07:56 |
|
Sartak posted:Maintain a separate hash of filenames you've seen thus far. Skip files you've seen. Ahh its probably the best idea for now, was hoping there was some special way I could treat a hash I couldn't think up, like, I dunno, a trick using multiple keys or something.
|
# ? Dec 1, 2008 07:59 |
|
Set::Object probably has better memory and speed performance than hashes.
|
# ? Dec 1, 2008 08:03 |
|
Huh thats kind of odd.. is there a particular reason why hashes appear to be so slow compared to this module?
|
# ? Dec 1, 2008 08:18 |
|
Sets have fewer operations. You can add or remove items, and check if an item is in the set. Hashes have to do more work, tracking a value for each item. Though hashes are optimized as gently caress in Perl, sets can easily be made even faster.
|
# ? Dec 1, 2008 08:40 |
|
Ah, no key->value pairs of course, I kinda glossed over it and didn't pick up on that. Anyway, I implemented the above by storing the filenames as keys, their hashes as values, then when I need to get the duplicates later on I swap the keys and values around into a new temporary hash and the duplicates all show up like before. Not exactly memory efficient but it works and I don't need to rehash files or loop over the entire hash. Might switch to Set::Object or sqlite later on.
|
# ? Dec 1, 2008 09:58 |
|
Erasmus Darwin posted:Quick and dirty: THANK YOU!!! Adding this in worked perfectly! Man, I had actually been reading about the fork function but wasn't able to put it together and realize that is what I needed to do. I figured since a forked process is a 'child' process the parent process would still wait for it to close before it finished. Still not sure exactly how its working but it is working! Guess I got more reading to do but since I have a working example it should make it easier to understand.
|
# ? Dec 1, 2008 10:49 |
|
NeoHentaiMaster posted:I figured since a forked process is a 'child' process the parent process would still wait for it to close before it finished. That only happens if you call wait, which as the name implies, makes the parent process wait until a child terminates. Wait's also useful for cleaning up zombie processes -- those are child processes that have already completed but which haven't been cleaned up via wait, yet. That's not really an issue here since the child process doesn't finish until after script1 is done, but it's worth knowing about for other situations. quote:Still not sure exactly how its working but it is working! Guess I got more reading to do but since I have a working example it should make it easier to understand. The thing to understand is what happens when you call fork. It creates an exact duplicate of your current process -- same code, copy of all the data, etc. The only difference is the return value from fork, namely 0 for the child, and the pid of the child for the parent. Also, since the child process is a copy, any changes to variables won't be seen by the parent and vice-versa. So you've essentially got two copies of script1 running once you call the fork. One's going to just pipe a few lines to script2 and then exit, and it doesn't matter that it's going to spend a while twiddling its thumbs while waiting on script2. The other goes on to finish the regular script1 stuff. It's also worth noting that if you leave off the "exit;" for the child process, it'll continue on executing the parent's code making a funny mess of things since all your post-fork code will get run twice by two different processes. Also, here's something fun to consider: code:
|
# ? Dec 1, 2008 15:48 |
|
Just a small question here: Is there a command that will execute another perl script? To make what I'm saying more clear, I have three scripts I want run one after the other. Is there a way to do that using perl?
|
# ? Dec 3, 2008 01:03 |
|
See perldoc -f system. I will not recommend the use of do EXPRESSION here (or anywhere ever).
|
# ? Dec 3, 2008 01:05 |
|
I have to recommend IPC::System::Simple's systemx and capturex, because they do everything I can't be bothered to think about.
|
# ? Dec 4, 2008 01:31 |
|
Quick question (maybe): I'm writing a script that reads a big list of words and prints out the number of words that don't have an a, # of ones that don't have an e, etc. I have the file I/O stuff working fine, but when I go to sort them, it sorts them incorrectly. Here's what I have: code:
19154 words do not have an 'u' 14858 words do not have an 'o' 13715 words do not have an 'i' 11953 words do not have an 'a' 10311 words do not have an 'e'
|
# ? Dec 6, 2008 01:11 |
|
fyreblaize posted:sort{$a <=> $b}@x; cmp compares two strings, which is what you want. Because the default comparator for sort is cmp, just sort @x is all you need. edit: Oh, it's not really clear what you're actually trying to do. You're assigning the hash %results to the list @x which flattens the hash structure. Maybe sort { $results{$b} <=> $results{$a} } keys %resluts? Filburt Shellbach fucked around with this message at 03:38 on Dec 6, 2008 |
# ? Dec 6, 2008 03:35 |
|
Along with what Sartak said, sort doesnt sort in-place either. So sort{$a <=> $b}@x; does nothing. Should be @x = sort{$a <=> $b} @x; but that's probably not the entire solution to your problem..
|
# ? Dec 6, 2008 03:51 |
|
aside: sort in a void context is optimized out, and never executes. I had a friend who would do sort {$a+=$b} @a
|
# ? Dec 7, 2008 13:01 |
|
Best practices for global configuration data? I'm leaning towards handing out a JSON struct via REST, as it's a fair bit quicker than reading a file off disk and doesn't go stale as easily. Am starting to feel the pain of redundant configuration files.
|
# ? Dec 8, 2008 08:20 |
|
SpeedFrog posted:Best practices for global configuration data? I'm leaning towards handing out a JSON struct via REST, as it's a fair bit quicker than reading a file off disk and doesn't go stale as easily. Am starting to feel the pain of redundant configuration files. Global configuration data in what context? What needs configuration? (Perl Questions -> JSON -> REST )
|
# ? Dec 8, 2008 19:44 |
|
Customer-facing and staff-only Catalyst apps, scripts that get run in crontab, and daemon processes, spread across geographically separate machines. For example, we have a cluster of RPC servers that talk to SMPP gateways, domain registries, that sort of thing, and it would be nice to have some sort of common method of specifying service details from both sides (e.g. from Catalyst when sending an SMS, or from the RPC server when processing a delivery receipt for the same). I could just as easily use YAML, Storable or DDS, but then I don't get the benefits of native browser support.
|
# ? Dec 9, 2008 02:15 |
|
http://hop.perl.plover.com/book/ Higher order perl is available online.
|
# ? Dec 10, 2008 00:05 |
|
For god's sakes read it!
|
# ? Dec 10, 2008 03:38 |
|
I figured everyone here's been a Good Boy (tm) and already has their own, physical copy of HOP.
|
# ? Dec 10, 2008 04:23 |
|
Currently working on a web app that uses sqlite and does some heavy scrunching on market order data grabbed from a third party. Thanks to NYTProf i managed to get its runtime divided by a factor of 400, so i can only chime in on the love for it here.
|
# ? Dec 14, 2008 12:24 |
|
Figures that three hours after the previous post i stumble upon an actual question: I am trying to do a large number of inserts (30 chunks with ~200_000 inserts each) into an SQLite database and am looking for methods to speed that up since doing single inserts for each is pitifully slow. So far i got this: code:
code:
code:
|
# ? Dec 14, 2008 15:22 |
|
My suggestion would be to rip away all the "other" code and just concentrate on getting the statement handle to run with just one bind param. You never know if you mistype something or something is magically expanding. I guess you just have to take that the SQLite driver is "correct". That or it's broken?
|
# ? Dec 14, 2008 15:50 |
|
The reason why i'm suspecting it's an SQLite limitation is because it just plain cannot do things like this: INSERT into table (one,two) VALUES (1,2),(2,3) Gonna try restricting it though and see what happens. Edit: Now i'm confused... DBD::SQLite::st execute_array failed: 1 bind values supplied but 8 expected Edit2: Ok, looked at the documentation again. I'm a retard. I thought the binds bound data sets, but instead bind columns. Mithaldu fucked around with this message at 16:00 on Dec 14, 2008 |
# ? Dec 14, 2008 15:53 |
|
Mithaldu posted:I am trying to do a large number of inserts (30 chunks with ~200_000 inserts each) into an SQLite database and am looking for methods to speed that up since doing single inserts for each is pitifully slow. Try wrapping chunks of inserts in a transaction (ie: BEGIN, INSERT x 10000, COMMIT). Also check out PRAGMA synchronous if you haven't already.
|
# ? Dec 16, 2008 16:32 |
|
Did both of these already. Also, just tried splitting it into chunks of 10000 and it added on 30% more processing time. That's how it looks right now, and i don't think there's anything more that can be done. code:
|
# ? Dec 16, 2008 18:23 |
|
There's way too much going on there. Have you tried doing just the inserts? No conversions, no error logging, no file opening, no looping. Try preprocessing the file and then insert that and see how that goes. Gosh, your code is all over the place. Does the SQLite driver not implement $dbh->begin_work and $dbh->commit? Edit for mantra: Isolate. Isolate. Isolate.
|
# ? Dec 16, 2008 18:40 |
|
Mithaldu posted:Did both of these already. I can't speak to SQLite specifically but in general commits are a very expensive operation. Assuming that SQLite supports transactions you should disable AutoCommit and issue a commit manually at the end of the transaction.
|
# ? Dec 16, 2008 19:23 |
|
as far as SQLite is concerned, these are all exactly the same thing: AutoCommit=0 / commit BEGIN / END BEGIN / COMMIT begin_work / commit I've implemented all of these before and the results and benchmarks are all identical. Only reason i am using BEGIN/END is that that's what the SQLite manual talked about and it makes for good visual markers in the code. However i'd like to hear about why begin_work / commit is better. Triple Tech posted:There's way too much going on there. Triple Tech posted:Gosh, your code is all over the place. I realize i didn't spend much time making it look pretty, but i'd like to hear what more experienced people have to say anyway.
|
# ? Dec 16, 2008 19:59 |
|
Mithaldu posted:I've implemented all of these before and the results and benchmarks are all identical. Only reason i am using BEGIN/END is that that's what the SQLite manual talked about and it makes for good visual markers in the code. Code should be written from both a semantic and aesthetic perspective. Design-wise, you should be telling the driver what to do, not having the driver just pass on you talking to the database directly. If the driver abstracts the concept of starting and stopping transactions, and that abstraction has no implementation penalty, then you should use it. For visual markers in your code, just use space, comments, and an editor with syntax highlighting. Mithaldu posted:You mean as far as pinpointing the performance problems goes? Haven't taken it apart yet, because i'm pretty confident that i know which parts impact how much. However, I've also run it through NYTProf and the result seems pretty straight-forward: http://drop.io/3gtix2e I don't have much (any?) experience with NYTProf, but isolating parts of your code means 1) better designed code 2) being even MORE sure that what you're looking at is the right thing to look at. I mean, how many times have programmers said "I'm totally sure this is this the cause of the problem" only to be corrected by something else a few seconds later. Even if you are right, it doesn't hurt. Your code needs like a fashion makeover, but that's just me and how I would go diagnosing a problem like this. Because with clean code, I can be more sure that altering the non-execute parts of it will not help improve total runtime. With yours, I'm not so sure. And even if NYTProf proves it, I don't know how to read it.
|
# ? Dec 16, 2008 20:18 |
|
Triple Tech posted:Code should be written from both a semantic and aesthetic perspective. Design-wise, you should be telling the driver what to do, not having the driver just pass on you talking to the database directly. If the driver abstracts the concept of starting and stopping transactions, and that abstraction has no implementation penalty, then you should use it. Triple Tech posted:For visual markers in your code, just use space, comments, and an editor with syntax highlighting. Triple Tech posted:NYTProf? Open the file index.html in any browser. Keep in mind it is a line-based profiler. The top table tells you how much realtime it spent on what subroutine call in what module. The table below lists it by module. DumpGetter.pm is the main module i'm working on. If you click on any report in the lower table, it shows you the time spend in the relevant things in a new page. Triple Tech posted:Isolation.
|
# ? Dec 16, 2008 20:46 |
|
Mithaldu posted:You mean as far as pinpointing the performance problems goes? Haven't taken it apart yet, because i'm pretty confident that i know which parts impact how much. However, I've also run it through NYTProf and the result seems pretty straight-forward: http://drop.io/3gtix2e According to your profiling, your DELETE statements are eating up way more time than the INSERTs. INSERTs seem pretty quick at 0.00014 seconds per query, although I'm not that familiar with Sqlite performance characteristics. If you take the DELETEs out of the equation, you are spending more time on CSV-related tasks than you are in DBI.
|
# ? Dec 16, 2008 21:18 |
|
satest4 posted:According to your profiling, your DELETE statements are eating up way more time than the INSERTs. drat, you're right, I completely missed that. Guess i really should try parsing the files manually. satest4 posted:INSERTs seem pretty quick at 0.00014 seconds per query
|
# ? Dec 16, 2008 21:30 |
|
|
# ? May 17, 2024 19:27 |
|
Hmm, how to explain isolation... It's just the clear and easy separation of concepts. Your code is doing many, many things and the task at hand was to examine the performance of executing a particular SQL statement. So, the best way to do that is design your code in such a way that it's easy to enable/disable/shuffle around the different parts, so that in such a case where we want to focus on something, the other stuff isn't bogging it down. I'm sure the more experienced goons can explain what I'm talking about, but they usually don't read this particular thread. To illustrate, your code is doing a lot: • Database prep • Logging • Error handling • File manipulation • File parsing • Looping • Translation • Insertion • Deletion See how much stuff that is? Those are all large, separate concepts that each have their own problems. By designing them in such away that they aren't as... intertwined as your code, it makes debugging easier. Bleh, gotta get back to work...
|
# ? Dec 16, 2008 22:04 |