|
FreshFeesh posted:I'm trying to iterate through an alphabetized list and edit/remove entries based on particular criteria, but as I scale up the code I want to make sure I'm being as clean as possible, since I haven't touched this stuff in a while. Do you actually care about whether these things are in sequence? If not, you can just split the string on a space, use the first value as a key, and dump things into a hash of arrayrefs.
|
# ? Jul 24, 2016 12:29 |
|
|
# ? Apr 25, 2024 18:29 |
|
I considered that, but unfortunately my data is more akin tocode:
code:
FreshFeesh fucked around with this message at 13:09 on Jul 24, 2016 |
# ? Jul 24, 2016 13:04 |
|
Seriously, do fix those var names, they're the main thing making this hard to read. Also, you can condense this: elsif ($newout[$i] =~ m/^Favorite Drink:/ or $newout[$i] =~ m/^Favorite Food:/ or $newout[$i] =~ m/^Something Else:/ or $newout[$i] =~ m/^Dance Party:/) { to elsif ($newout[$i] =~ m/^(Favorite Drink|Favorite Food|Something Else|Dance Party):/) { Also add a next; after each operation where you consider the line concluded, and you can get rid of a lot of the control structure. code:
|
# ? Jul 24, 2016 13:23 |
|
Thank you for the suggestions, they're certainly helping and things look a lot better.
|
# ? Jul 24, 2016 13:38 |
|
I'm working with large 200-500 Gb text files, and I want to find lines that match a pre-defined list line ids and return the next line following the match. Here's my starting code:code:
|
# ? Aug 1, 2016 20:48 |
|
octobernight posted:I'm working with large 200-500 Gb text files, and I want to find lines that match a pre-defined list line ids and return the next line following the match. Here's my starting code: 1) It won't match the ids unless you chomp the line first. 2)defined $ids{foo} is always going to return false because the value of $id{foo} is set to undef. What you want is exists $id{foo} Unless the id line and the content line have fixed widths--and I imagine they do not--then threads aren't going to help. First off, you won't know how many threads to spawn because you have to read through the entire file to get a line count. Then, each thread is going to have to read lines and throw them away until they get to their starting point.
|
# ? Aug 2, 2016 03:44 |
|
You can make threads work, certainly you can. (well I'd use Parallel::ForkManager mostly because true threads aren't needed) Is there a way to differentiate the key lines from the value lines? Do keys match a certain pattern? If you seek to a part of the file, assume you're mid-line and read in the current line assuming it's junk, then scan in the next line, can you determine via a regex if that line is a key, or if it's a value? If so then this is the easiest case, each child can run independent of the parent. Otherwise, just have each thread assume that either one of the following are true, the first line in its slice is a key and subsequent is a value, OR that the first line is a value and subsequent lines form key/value pairs. When done, send both results to the parent, and the number of lines processed. Once the parent has all processed results from its children, it can reconstruct whether the first line each slice processed was a key or a value (based on the number of scanned in lines each child sends to the parent) and choose the processed results accordingly. It might be space prohibitive and it's going to be a bit pesky to write, but it's conceptually simple. Alternately, maybe you can use gnu parallel for this. A test program: perl -e 'BEGIN { $x = "x"x60 } print "key $_ $x\nval $_ $x\n" for 1 .. 214' | parallel --no-notice --block 500 --pipe -L 100 egrep --line-buffered -A1 -n "'^key .*7 '" How you'd use it: cat BIGFILE.txt | parallel --no-notice --pipe -L 50000000 fgrep --line-buffered -A1 -x -f PATTERNSFILE.txt You would have to parse the output to make sure matched keys are on 'odd' lines, GNU Parallel will limit the maximum number of line sent to fgrep forcing each new parallel block to be on a clean guaranteed key line. GNU Parallel is also written in Perl so it might not be much of a performance improvement though.
|
# ? Aug 2, 2016 09:05 |
|
octobernight posted:threads Threads in Perl were created by Activestate on a contract from Microsoft to create fork emulation in Perl on Windows. This means whenever you start a thread in Perl you actually start a hacky fork emulator, with the consequences being: - thread startup is super expensive, start as few as you can and reuse - thread startup duplicates the memory use of the mother thread; so start threads as early as possible, with as few things loaded as possible - thread data is not shared by default, you need to mark every variable you wish to share - threads are unreliable and break easily, try to use multiple processes and/or native fork first
|
# ? Aug 2, 2016 10:55 |
|
Thanks for the replies all.John Big Booty posted:This code won''t work. Thanks! I meant to use exists, not defined, and I left out the code that trims the lines so that all whitespace is removed from the ends. I wasn't planning on start exactly at a particular line, rather I was going to jump to XXXX byte location for each thread. I would then throw away whatever line it read it and find the first complete line that has the line start identifier (in my file, the id format is $id). It was something off the top of my head, though. homercles posted:You can make threads work, certainly you can. (well I'd use Parallel::ForkManager mostly because true threads aren't needed) Yes, there is. Id lines start with "$" so it makes it easy to figure out if it's an id line or value line. I'll probably try my seek strategy then. quote:Alternately, maybe you can use gnu parallel for this. Thanks for this, too. I've never used the parallel in GNU before, but I'm reading over it now. Mithaldu posted:There's one thing you need to be aware of: Thanks for the advice. I did not know that the threads duplicate of the memory of the motherthread. I don't think this will be an issue since the motherthread shouldn't be using much memory, but I will keep an eye on it, as well as any unexpected behavior from threads.
|
# ? Aug 2, 2016 18:57 |
|
octobernight posted:Thanks! I meant to use exists, not defined, and I left out the code that trims the lines so that all whitespace is removed from the ends. I wasn't planning on start exactly at a particular line, rather I was going to jump to XXXX byte location for each thread. I would then throw away whatever line it read it and find the first complete line that has the line start identifier (in my file, the id format is $id). It was something off the top of my head, though. Instead of having the threads jump in at a random location and discard lines, stat the fil and, divide the size by the number of threads. Seek to $chunksize*$i, read until $ preceded by a newline. Feed those offsets into the worker threads and you won't have to discard any lines and the almost-guaranteed mangled data resulting therefrom. That said, this whole setup seems a lot less than ideal. What sort of data is in these files and how are they generated? What happens to the data pulled out?
|
# ? Aug 3, 2016 01:40 |
|
John Big Booty posted:With a bit of massaging, that could be a workable approach. That's for the suggestion. I'll be testing this implementation today. I greatly simplified the data description to get a general algorithm, but basically I'm working with next generation sequencing data, and I'm trying to find specific reads in a file that I've identified through a different analysis. Basically, I had a huge sequencing file, I run the whole file through a program that identifies key sequences (but only their names, not the original sequences). Thus, I need to pull the raw reads from the file which are then fed to another program. I've never worked with sequencing data before, nor am I too familiar with working on multi-threaded/multi-core programming, however it's a requirement I pick this up quickly due to the amount of data I'm processing (~hundreds of TBs).
|
# ? Aug 3, 2016 18:37 |
|
Is there a good guide for changes from 5.14 to 5.24? Googling hasn't been helpful and my copy of Programming Perl might never be updated again so I want to know what I'm doing wrong.
|
# ? Nov 5, 2016 03:31 |
|
perldelta is the be all end all of what changed: http://perldoc.perl.org/index-history.html As to the specifics I tend not to personally delve too much into them, I still have to work on environments stuck in 5.10.0 (not that I'm complaining, it still feels quite modern before smartmatch was butchered). Perhaps a kinder soul (Mithaldu?) could delve into the big things that changed.
|
# ? Nov 5, 2016 06:17 |
|
To be honest, i'm not entirely sure much world-shifting stuff happened? post-fix deref is amazing if you can know you'll work on new perls autoderef is dead and gone Other than that, i'm not really aware of much that i can actually use.
|
# ? Nov 6, 2016 00:12 |
|
Mithaldu posted:To be honest, i'm not entirely sure much world-shifting stuff happened? Subroutine signatures are pretty nice, too. And there's also the double diamond operator, which is worth knowing about, I suppose.
|
# ? Nov 9, 2016 10:40 |
|
I'm going nuts over the most bizarre issue regarding reading gzipped files using PerlIO-gzip-0.19 module. I have a gzip file that's about 30GB in size. I read 4 lines at a time from it, and print out the total number of 4x lines read, and I get 46,285,997. However, if I decompress the file from the command line, I see that there's actually 462,843,365 4x lines in the file. How am I off by an order of magnitude in the number of 4x lines. Here's my code:code:
|
# ? Feb 7, 2017 17:25 |
|
That looks like an early stop. Make sure you verify the error conditions of all the system calls you're doing after you do them.
|
# ? Feb 7, 2017 17:28 |
|
Mithaldu posted:That looks like an early stop. Make sure you verify the error conditions of all the system calls you're doing after you do them. By early stop, do you mean my script terminates incorrectly? As far as I can tell, that's not the case because I have a print statement outside the loop that shows that it completed the loop. The only other thing I can think of is, is it possible that my while condition is incorrect? code:
I should've done this from the beginning. The error happens across different file systems, always on the same line. I was too lazy to actually open the file and look at that line to see if there's something wrong with it. If there is (i.e., a blank line throwing off the reader), that would really suck because we would need to re-verify terabytes of data. I do recall that the person that had originally generated the data had issues putting the original file together (a bunch of separate gzip files that they parrallel-ly recombined into a single file). If they had inserted some unexpected blank line in there, that's going to screw up everything. Thanks!
|
# ? Feb 7, 2017 18:31 |
|
See http://perldoc.perl.org/functions/readline.html for more detail. But basically, readline() doesn't throw exceptions. It sets an error variable. So if readline errors out your while loop ends because it returned something undef. The linked article also shows ways to do this safer.
|
# ? Feb 7, 2017 18:39 |
|
Mithaldu posted:See http://perldoc.perl.org/functions/readline.html for more detail. But basically, readline() doesn't throw exceptions. It sets an error variable. So if readline errors out your while loop ends because it returned something undef. The linked article also shows ways to do this safer. Thank you! That's super helpful. I'll make sure to update all my scripts to follow these practices.
|
# ? Feb 7, 2017 18:44 |
|
Cheers and good luck with that. For what it's worth: Modules like autodie and IO::All also hide a lot of this "need to do it to do it safely" behind convenient automation.
|
# ? Feb 7, 2017 23:08 |
|
The hospital I work for is trying to be more resource efficient and reuse or recycle our office furniture. I learned today that they don't have any sort of inventory management system for the used furniture that they receive, or keep track of what they give back out or recycle. They also don't have a way for users to browse through what used furniture is available to be redeployed. That got me thinking of a possible project. I have no expectations of actually deploying it but it seems like it would be great for learning. I'm thinking I'd learn a little bit about databases to catalog the furniture, and web frameworks to allow users to see and choose items. From a design perspective, when tackling a little larger project like this, where do I start? Should I start with picking a database to use and build that up, then go from there? Do I start with learning the web framework like Dancer2 or Mojolicious? I really want to move beyond writing simple scripts that parse CSVs and the like but I always stall out in the design phase of a larger program, and I don't know anyone IRL that programs that I can bounce ideas off of.
|
# ? Feb 18, 2017 01:33 |
|
As far as databases go, unless you're getting a ready-made database structure provided, just use Postgres. Everything else is either poo poo, insane, or too expensive. And if the framework you end up with doesn't provide an ORM, use DBIx::Class. As far as framework goes, Dancer2 is you expect it to be very small even 10 years down the road, Mojolicious if you don't mind it being experimental bleeding edge and breaking easy or expect trouble with managing dependencies (there should be none if you just use carton/local::lib), or you just go with Catalyst because it scales well, is stable, and is well documented.
|
# ? Feb 18, 2017 10:14 |
|
|
# ? Apr 25, 2024 18:29 |
|
Mithaldu posted:As far as databases go, unless you're getting a ready-made database structure provided, just use Postgres. Everything else is either poo poo, insane, or too expensive. And if the framework you end up with doesn't provide an ORM, use DBIx::Class. Thanks for the information, I'll see what I can get in to!
|
# ? Feb 19, 2017 00:11 |