Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
uG
Apr 23, 2003

by Ralp

homercles posted:

What happens if you declare your methods static? You're polluting the global namespace for no reason.
Nothing. If there was a namespace conflict I expect the compiler would tell me.

I'm considering just saying gently caress it and using Perl datatypes. Is there a significant speed penalty in doing so?

uG fucked around with this message at 06:32 on Nov 3, 2012

Adbot
ADBOT LOVES YOU

Aquarium of Lies
Feb 5, 2005

sad cutie
:justtrans:

she/her
Taco Defender

uG posted:

I'm having some XS troubles. First, let me present the code:
http://pastebin.com/SeMGSuA8

If you strip out the cxs_edistance function and the Perl headers, it will compile and return the correct result. This comes as a surprise to me because my Perl tests have been failing and I thought my C was bad.

As you can see, I hard coded the same values in the function main as I did to the end of cxs_edistance. This was based on a guess that the data was getting passed in wrong from my wrapper function, but proved that the problem lies elsewhere.

The error is a segmentation fault, and it happens the first time it hits line 58. I don't mean to make this into a 'debug my C code' post inside the Perl thread, especially since the C code appears to work on its own. Just kinda wondering where I should look next, since i've apparently been tweaking C code for no reason.

perl Makefile.PL and make test always fail (it never gives a reason, I believe the testing module crap out entirely on XS segfaults?), and a forced install (of this code as a module) always results in segfault for whatever script calls the exported function.

I've never worked with XS so I'm just shooting in the dark here, but it looks like you're going past the end of your arrays in the two "setup scoring matrix" loops.

uG
Apr 23, 2003

by Ralp
It is not the loops. This can be demonstrated by the if statement inside, which is where its seg faulting on its first iteration. Removal of the if statement (and just letting the code block inside it run every time) results in the code working as intended.

So why don't I just take them out? I could, but I 'want' the (struct dictionary) values to be unique. What happens when we take out those if statements is that (struct dictionary) gets stuffed with duplicate values (keys), but later when I change a specific value it always iterates to the first occurrence and sets/gets 'its' value. But i'm left with a bunch of junk we never use (which seems sloppy), so i'm not going to just leave it at that.

Namespace was a pretty good guess, since a conflicting namespace could potentially only screw it up when compiling with the Perl headers.

What I can say is its directly related to the linked list (item* head) in the if statement I mentioned above. Removal of the Perl headers, XS prototypes on the bottom, and cxs_edistance function removed result in code that when compiled, returns the expected value (so the C guys I know think i'm crazy).

For what its worth, here is the Pure Perl version of the above:
https://github.com/ugexe/Text--Levenshtein--Damerau/blob/master/lib/Text/Levenshtein/Damerau/PP.pm
Its not complex, its just an edit distance between strings. The difference we have between this and the XS above is we're working with ints instead of chars (to handle different character widths), as demonstrated by the XS wrapper in the .pm:
code:
sub xs_edistance {
    # Wrapper for XS cxs_edistance function
    my $str1 = shift;
    my $str2 = shift;
    my @arr1 = unpack 'U*', $str1;
    my @arr2 = unpack 'U*', $str2;

    return cxs_edistance( \@arr1, \@arr2 );
}

uG fucked around with this message at 18:13 on Nov 3, 2012

homercles
Feb 14, 2010

uG posted:

What I can say is its directly related to the linked list (item* head) in the if statement I mentioned above. Removal of the Perl headers, XS prototypes on the bottom, and cxs_edistance function removed result in code that when compiled, returns the expected value (so the C guys I know think i'm crazy).
code:
 39   item *head,*curr,*iterator;
 40   head = (item*)malloc(sizeof(item));
 41   curr = head;
...
 58     if(hash(head,src[i]) == NULL){
The contents of head haven't been initialised, causing head->next->next to segfault via hash().

woon socket
Sep 30, 2011

by XyloJW

homercles posted:

code:
 39   item *head,*curr,*iterator;
 40   head = (item*)malloc(sizeof(item));
 41   curr = head;
...
 58     if(hash(head,src[i]) == NULL){
The contents of head haven't been initialised, causing head->next->next to segfault via hash().

There is no call to head->next->next:

First call to hash (line 23):
code:
  iterator = head;  //iterator = { next =  , value = , count = }
  while(iterator->next){  // { untrue }
    ..
  } // skipped to here because while is untrue
  return NULL;

homercles
Feb 14, 2010

tonski posted:

There is no call to head->next->next:

First call to hash (line 23):
code:
  iterator = head;  //iterator = { next =  , value = , count = }
  while(iterator->next){  // { untrue }
    ..
  } // skipped to here because while is untrue
  return NULL;
I'll use the line numbers from pastebin.

40: head is malloc'd. contents of head contains garbage has it has not been memset, so: head = { next = <garbage>, value = <garbage>, count = <garbage> }
58: hash(head, src[i]) is called. head's contents still have not been initialised
23: item* iterator = head;
24: while(iterator->next){ // the truthfulness of this statement is undefined. it might be true, might be false. it depends on the memory allocated by malloc. we will assume it's true for this example as that will cause a segfault
25: if(iterator->value == index){ // undefined. assume false for this example
28: iterator = iterator->next // that is, iterator = head->next. head->next contains garbage. iterator is now filled with nonaddressable garbage.
23: while(iterator->next){ // this may segfault, we're testing head->next->next which may fail as head->next was never initialised with a value, attempting to deref it is undefined behavior

homercles fucked around with this message at 21:32 on Nov 3, 2012

uG
Apr 23, 2003

by Ralp
Alas, that is not the problem either. FWIW, you can compile this and it will spit out '1': http://pastebin.com/M86yLumM

Same code with the Perl headers slapped on, export/call main() (no arguments, the values are hard coded) from the XS.pm wrapper (instead of xs_edistance), and we segfault in the same spot we've been discussing (which, again, works perfectly fine outside the Perlish enviroment).

Polygynous
Dec 13, 2006
welp
That's still assuming the Perl version isn't doing something differently with malloc, which I'm not sure you can. (From my brief googling of "xs perl malloc" which mostly just hurt my head, especially one post recommending checking everything including the return value of malloc not being NULL...)

homercles
Feb 14, 2010

uG posted:

Alas, that is not the problem either. FWIW, you can compile this and it will spit out '1': http://pastebin.com/M86yLumM

Same code with the Perl headers slapped on, export/call main() (no arguments, the values are hard coded) from the XS.pm wrapper (instead of xs_edistance), and we segfault in the same spot we've been discussing (which, again, works perfectly fine outside the Perlish enviroment).

The malloc stuff is a problem. It's not the problem but it's a problem. I ran your code on my machine and it prints out 3 not 1, because you've got array corruption too. On my machine, writing to scores[ax+1][ay+1] was changing the value of ay.

Here's my version that works and prints 1: http://pastebin.com/uN0dp9DT

I changed how push and hash work, removed item *curr,*iterator from scores. Changed the scores array to int scores[ax+2][ay+2], because you're reading and writing to it past its bounds.

Declaring an array int x[1] and then reading/writing to x[1] is broken, so the array has to be 1 larger. Same with declaring int scores[ax+1][ay+1] and then reading/writing to scores[ax+1][ay+1]

uG
Apr 23, 2003

by Ralp
You, sir, have saved my week long downward spiral into madness. I can now happily stay the hell away from C for a little while :)

Crush
Jan 18, 2004
jot bought me this account, I now have to suck him off.
I am trying to save user input as a variable in a Perl script. I already have this part down. The part I need help with is when the input has characters that require escaping. I have a feeling that substitution would help here, but I don't know how to implement it against the variable. How do I handle this?

This is how I currently have it acquiring the data:

print "Question: ";
my $variable = <>;
chomp( $variable );


I'd prefer to not call in modules and just use pure Perl for this if possible. Please help out this Perl n00b!

ShoulderDaemon
Oct 9, 2003
support goon fund
Taco Defender

Crush posted:

I am trying to save user input as a variable in a Perl script. I already have this part down. The part I need help with is when the input has characters that require escaping.

How you escape data has absolutely nothing to do with how you acquire a variable, and everything to do with how you intend to use it. Why do you think your input needs escaping? What are you actually trying to do?

Crush
Jan 18, 2004
jot bought me this account, I now have to suck him off.

ShoulderDaemon posted:

How you escape data has absolutely nothing to do with how you acquire a variable, and everything to do with how you intend to use it. Why do you think your input needs escaping? What are you actually trying to do?

More specifically, I am trying to have the input go into the body of an HTML page. I believe it needs escaping because when I paste something that has parenthesis, brackets, etc. it errors out, but when I just have alphanumeric characters, it doesn't error out.

ShoulderDaemon
Oct 9, 2003
support goon fund
Taco Defender

Crush posted:

More specifically, I am trying to have the input go into the body of an HTML page. I believe it needs escaping because when I paste something that has parenthesis, brackets, etc. it errors out, but when I just have alphanumeric characters, it doesn't error out.

HTML does need escaping, but neither parenthesis nor brackets should cause a problem. It'd be helpful if you posted an actual error, rather than just saying "it errors out".

To escape for HTML, it's probably easiest to use:
Perl code:
use CGI qw( escapeHTML );

my $escapedString = escapeHTML( $string );
CGI is a core module, so it should be present on every installation of Perl.

Notorious b.s.d.
Jan 25, 2003

by Reene

Crush posted:

I'd prefer to not call in modules and just use pure Perl for this if possible. Please help out this Perl n00b!

This defeats the purpose of perl.

If you're uninterested in the deep set of extensively-tested libraries, there is pretty much no reason to use perl for sw development.

It just becomes a sed/awk replacement in your cron jobs.

Ninja Rope
Oct 22, 2005

Wee.
In addition to that, many modules are pure perl.

raej
Sep 25, 2003

"Being drunk is the worst feeling of all. Except for all those other feelings."
I wrote a quick and dirty perl script that uses HTTP::Request and LWP::UserAgent to pick out pieces from a page and spit them out to a file.

Is there a way to tell these to click into links then scrape? The problem I'm facing is that I can't recursively go through a URL (something like site.com/page/1) to get to each page, but there are links to each page divided by letters.

The pages are all here so it would need to go to "0-9" then the first link in the <table><tr><td> and run the scraping part, go up one, click the second link and run the scraper again, etc.

Is there any easy way to do this, or should I go the python route?

raej fucked around with this message at 04:10 on Dec 14, 2012

Erasmus Darwin
Mar 6, 2001

raej posted:

Is there a way to tell these to click into links then scrape?

Two options come to mind:

1) Add HTML::TreeBuilder to the mix and use that to extract the links you need. Something like this:

code:
use LWP;
use HTML::TreeBuilder;

my $ua = LWP::UserAgent->new;
my $result = $ua->get('http://www.ratebeer.com/BrowseBrewers.asp');
die if ! $result->is_success;
my $tree = HTML::TreeBuilder->new_from_content($result->content);
my @brewer_letter_links = $tree->look_down(
    _tag => 'a',
    href => qr(^/browsebrewers)
);
print "Brewer letter links are: ";
print join(', ', map $_->attr('href'), @brewer_letter_links), "\n";
2) Just use WWW::Mechanize for everything, as it integrates both the HTTP stuff (via LWP::UserAgent) and the HTML parsing.

leedo
Nov 28, 2000

As another option there is Web::Scraper
code:
use v5.14;
use Web::Scraper;
use URI;

my $scraper = scraper {
  process 'a[href^="/browsebrewers"]', 'links[]' => '@href';
};

my $data = $scraper->scrape(URI->new("http://www.ratebeer.com/BrowseBrewers.asp"));
say join ", ", @{$data->{links}};

Filburt Shellbach
Nov 6, 2007

Apni tackat say tujay aaj mitta juu gaa!
HTML::LinkExtor

leedo
Nov 28, 2000

Too many ways to do it, I'm switching to python!

Mario Incandenza
Aug 24, 2000

Tell me, small fry, have you ever heard of the golden Triumph Forks?
Shoulda just used IO::Pty to run an instance of lynx and feed it some keystrokes, problem solved!

EVGA Longoria
Dec 25, 2005

Let's go exploring!

Mojo is the one I would use, since its got both the UA stuff and DOM parsing built in.

The Gripper
Sep 14, 2004
i am winner
You could also probably just wget everything first then do your html scraping locally and not use any of your dumb scraper suggestions! Yeaaaahhhhh:
Bash code:
wget -r --no-parent -l1 -I "brewers" http://www.ratebeer.com/browsebrewers-{0-9,{A..Z}}.htm
(uses bash brace expansion)

Jonny 290
May 5, 2005



[ASK] me about OS/2 Warp

Mario Incandenza posted:

Shoulda just used IO::Pty to run an instance of lynx and feed it some keystrokes, problem solved!

You laugh, but I had to deploy a rudimentary status page last week by telnetting into a common jump server, telnetting into a specific site server, SSHing into the wireless controller| tee logfile.txt, spew commands blindly, parsing the results via 'typed-in' one-liners, and FTPing the results back home for further chewing by the script this is all wrapped in.

Capturing output of Net::Telnet cmd() was failing, due to flaky network connections. We had to capture locally, then parse. And they removed Net::SSH years ago.

It was disgusting, sneaky, and made me feel bad as a Perl coder. But when your sole customer says "We need this by X date" and also says "You may not install any software, scripts or make any changes until X date + 6 weeks", you figure out workarounds.

Anaconda Rifle
Mar 23, 2007

Yam Slacker

Jonny 290 posted:

telnetting into a common jump server, telnetting into a specific site server, SSHing into the wireless controller| tee logfile.txt, spew commands blindly, parsing the results via 'typed-in' one-liners, and FTPing the results back home for further chewing by the script this is all wrapped in.

rgoldberg.pl

raej
Sep 25, 2003

"Being drunk is the worst feeling of all. Except for all those other feelings."
Both those examples are awesome. But what I really need to do it crawl to those links, then the links on each of those results. And on those pages, extract certain portions of the page out.

I've written most of the extraction part for each brewery's page, but it's the crawling part I'm having difficulty with.

Blotto Skorzany
Nov 7, 2008

He's a PSoC, loose and runnin'
came the whisper from each lip
And he's here to do some business with
the bad ADC on his chip
bad ADC on his chiiiiip

raej posted:

Both those examples are awesome. But what I really need to do it crawl to those links, then the links on each of those results. And on those pages, extract certain portions of the page out.

I've written most of the extraction part for each brewery's page, but it's the crawling part I'm having difficulty with.

Are you just recursing on each link, maybe with a maximum recursion depth?

raej
Sep 25, 2003

"Being drunk is the worst feeling of all. Except for all those other feelings."
That's what I'm trying to figure out. From that starting point of http://www.ratebeer.com/BrowseBrewers.asp I'd need to go to each Alphabetic category, then each brewery listed. On each brewery's page is where I'd scrape the data.

code:
[url]http://www.ratebeer.com/BrowseBrewers.asp[/url]
..0-9
....1. Hildener Landbierbrauerei (Run Scrape)
....10 Barrel (Run Scrape)
....101 North (Run Scrape)
....et cetera
..A
....A Tribbiera (Run Scrape)
....A. Duus and Co. (Run Scrape)
....et cetera
..B
..et cetera

The Gripper
Sep 14, 2004
i am winner

raej posted:

That's what I'm trying to figure out. From that starting point of http://www.ratebeer.com/BrowseBrewers.asp I'd need to go to each Alphabetic category, then each brewery listed. On each brewery's page is where I'd scrape the data.
Does that part need to be in Perl, or is it just preference? Despite my post looking like a joke, the brewers pages are structured in probably the best way for wget to deal with, since you can use wget -r -I "brewers" and have it only download the data you want i.e. the browsebrewers-*.htm files and /brewers/<brewername>/<id>/index.htm files.

With that done you'll have all the data you need to deal with on disk, and you can just recurse through directories and run your extraction code on each index file.

raej
Sep 25, 2003

"Being drunk is the worst feeling of all. Except for all those other feelings."
That's not a bad idea at all. I tried running that with wget.exe, but I got an exception on the curly braces and I have no Linux box :-/

The Gripper
Sep 14, 2004
i am winner
Ah, I tested it out in the cygwin terminal on windows and it worked, so if that's something you're willing to install (base system+wget) it'll work for you.

It'll mean you then have cygwin as a dependency for your project though so that might not be something you want, though it shouldn't interfere with anything.

uG
Apr 23, 2003

by Ralp
code:
use strict;
use warnings;
use WWW::Mechanize;

my $mech = WWW::Mechanize->new();

foreach my $first ('0-9','A'..'Z') {
	my $url = 'http://www.ratebeer.com/browsebrewers-' . $first . '.htm';
	$mech->get($url);
	
	foreach my $brewery ($mech->links) {
		if($brewery->url =~ m!/brewers/(?<brewery_name>.*?)/\d\d*/?$!) {
			print $+{'brewery_name'} . " = " . $brewery->url_abs . "\n";
		}
	}
}
This will get you the absolute link to each brewery.

tef
May 30, 2004

-> some l-system crap ->
Happy birthday perl :3:

prefect
Sep 11, 2001

No one, Woodhouse.
No one.




Dead Man’s Band

tef posted:

Happy birthday perl :3:

I love Perl, and screw all the computer scientists who look down their noses at me. :colbert:

Rohaq
Aug 11, 2006
So I wrote a script to feed AMQP feeds and raw log files into a Vertica database. It does this by parsing a text string, getting the relevant fields, building a new list of strings in comma separated format, then once it hits a set number of list items, blasts the list into a CSV file and makes a system call to the binary responsible to bulk load that file into the DB. Once that's confirmed as loaded, the list is cleared, and the process continues.

It functions in this roundabout way due to a number of limitations in the bulk load methods provided by Vertica - the method I use is the only way I can see to bulk load without requiring a database user with excessive privileges. The problem I've come across now is that the script seems to become very sluggish after processing a few million entries, with users reporting some oddly high CPU and memory usage.

My guess is that constantly filling and flushing the list of entries is filling up memory - I'm not sure how good Perl is with garbage collection, so perhaps when the list is 'cleared', the memory it was taking up isn't actually being freed or reused. Is there anything I can do to confirm this, and does anybody know a better method to use to avoid such problems?

Sang-
Nov 2, 2007

Rohaq posted:

So I wrote a script to feed AMQP feeds and raw log files into a Vertica database. It does this by parsing a text string, getting the relevant fields, building a new list of strings in comma separated format, then once it hits a set number of list items, blasts the list into a CSV file and makes a system call to the binary responsible to bulk load that file into the DB. Once that's confirmed as loaded, the list is cleared, and the process continues.

It functions in this roundabout way due to a number of limitations in the bulk load methods provided by Vertica - the method I use is the only way I can see to bulk load without requiring a database user with excessive privileges. The problem I've come across now is that the script seems to become very sluggish after processing a few million entries, with users reporting some oddly high CPU and memory usage.

My guess is that constantly filling and flushing the list of entries is filling up memory - I'm not sure how good Perl is with garbage collection, so perhaps when the list is 'cleared', the memory it was taking up isn't actually being freed or reused. Is there anything I can do to confirm this, and does anybody know a better method to use to avoid such problems?

perl's gc is "okay", it can't handle cyclic references at all - so if you have an array containing a bunch of reference, then add a reference to the current array, perl will never be able to collect it.

high cpu usage doesn't really suggest that though (from my experience at least), might want to look into devel::gladiator and a few others

Crumbles
Mar 25, 2010
I've started learning Perl a few months ago for work, I'd like to think I've gotten a pretty decent handle on things. It's been pretty fun so far - I've been picking things up as I code and have been going through Programming Perl at my leisure... One thing I'm not sure I totally get are subroutine prototypes. From my understanding, they're really only useful if you want to be able to call a subroutine without parens (like Perl's built-in functions) and don't really offer much else on top of that, like what you'd get out of, say, a method signature in Java. Now, the code I'm working with uses prototypes everywhere. Am I not seeing some other benefit to using them?

het
Nov 14, 2002

A dark black past
is my most valued
possession

Crumbles posted:

I've started learning Perl a few months ago for work, I'd like to think I've gotten a pretty decent handle on things. It's been pretty fun so far - I've been picking things up as I code and have been going through Programming Perl at my leisure... One thing I'm not sure I totally get are subroutine prototypes. From my understanding, they're really only useful if you want to be able to call a subroutine without parens (like Perl's built-in functions) and don't really offer much else on top of that, like what you'd get out of, say, a method signature in Java. Now, the code I'm working with uses prototypes everywhere. Am I not seeing some other benefit to using them?
There are other benefits, like transparently passing an array by reference so you can change it a la push(), but you can be virtually guaranteed that anyone who uses perl function prototypes constantly has no idea what they're for and shouldn't be using them.

This reminds me of some code that I was asked to maintain when I got my first real job. My coworker who wrote the code was a novice programmer and not too familiar with perl, so she wrote something like this:
Perl code:
my $foo = '';
my $bar = '';
my $baz = '';

sub frob($$$) {
  $foo = 'flibbertygibbit';
  $bar = 'dfsjkfadjskl';
  $baz = 'sajdkffd';
}

frob($foo, $bar, $baz);

You can sorrrrt of see what she was thinking with the prototypes, but if you follow it to its logical conclusion, it's pretty hilariously crazy.

Adbot
ADBOT LOVES YOU

leedo
Nov 28, 2000

het posted:

There are other benefits, like transparently passing an array by reference so you can change it a la push(), but you can be virtually guaranteed that anyone who uses perl function prototypes constantly has no idea what they're for and shouldn't be using them.

One cool use for prototypes is passing blocks in without the need for sub.

e.g.

code:
#!/usr/bin/env perl

use v5.14;

sub mapp(&@) {
  my ($f, @a) = @_;
  my @col;
  for (@a) {
    push @col, $f->($_);
  }
  return @col;
}

my @plus_one = mapp { $_ + 1} (0,1,2);

say join ", ", @plus_one;

  • Locked thread