Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
Avenging Dentist
Oct 1, 2005

oh my god is that a circular saw that does not go in my mouth aaaaagh

BeefofAges posted:

Yeah, that's what I said to do, but the guy who asked me was like "I'm doing this in a lot of places, so it would be nice to make it compact". Heh.

So use shorter variable names. Or use a semicolon to put it on one physical line.

Adbot
ADBOT LOVES YOU

hey mom its 420
May 12, 2007

Python tries to avoid this kind of stuff because of the whole readability thing and that there should be only one way to write something. I guess you could hack something together by accessing the global dictionary, but blehh.

Lurchington
Jan 2, 2003

Forums Dragoon

tef posted:


edit: you said you were using lxml.html? can you paste the error message?

Scaevolus posted:

It works fine for me.


I'll try it again in the next day or so. I was using lxml.html and I was getting a SerialisationError when I tried to do a toString() on anything. If I tried to mess with specifying a unicode encoding, it gave me a Unicode error on some random character saying the ordinal was out of range(128).

This was using 2.2.6 lxml and 2.6.2 active python on windows XP.

ATLbeer
Sep 26, 2004
Über nerd

BeefofAges posted:

Yeah, that's what I said to do, but the guy who asked me was like "I'm doing this in a lot of places, so it would be nice to make it compact". Heh.

Doing something in a lot of places eh? Repeating yourself? If only there was a function?

make_string_and_append("make", "append")

Haha... but, no clarity is always important over cleverness. Just set the string and append.

tef
May 30, 2004

-> some l-system crap ->

Lurchington posted:

I'll try it again in the next day or so. I was using lxml.html and I was getting a SerialisationError when I tried to do a toString() on anything. If I tried to mess with specifying a unicode encoding, it gave me a Unicode error on some random character saying the ordinal was out of range(128).

This was using 2.2.6 lxml and 2.6.2 active python on windows XP.

try etree.tounicode(foo) ? can you pastebin a small example showing the error?

TOO SCSI FOR MY CAT
Oct 12, 2008

this is what happens when you take UI design away from engineers and give it to a bunch of hipster art student "designers"

Dijkstracula posted:

I tried Beautiful Soup this afternoon but it broke -- that is, threw some exception deep in its bowls -- on a reasonably trivial page.

lxml looks like a serious pain in that it has to be installed rather than my just dropping it in my working directory, and I'm working across OSs and need it to be basically self-contained...this is really preposterous, so I gather there's really no good solution for a reasonably simple thing like this? It's bullshit like this that causes people to just regex the poo poo out of HTML instead of using proper parsers.

edit: here's my team's torture test site. If you can find something that can make sense of it, let me know :)

You could try html5lib, which uses more-or-less the same handling for lovely pages that modern browsers do. Your test page is currently timing out for me, but html5lib works with like a million possible parser implementations and there's no way your page is so hosed up that it'll fail to parse.

Lurchington
Jan 2, 2003

Forums Dragoon

tef posted:

try etree.tounicode(foo) ? can you pastebin a small example showing the error?

I could be misunderstanding, but I'm not using (I assume you mean the etree that comes bundled with python) etree, this is with the lxml install from http://codespeak.net/lxml/installation.html#ms-windows
If etree does html I'll use it for sure.

edit: something must be hosed in my install or I'm unfortunately glossing over important details. I've only used xml parts of etree before and I had thought I was relatively comfortable with the api.

I tried to make it as barebones as possible:
http://www.pastebin.org/100314
code:
if __name__ == "__main__":

    page = lxml.html.parse("http://www.usccb.org/nab/032110.shtml")
    print lxml.html.tostring(page.getroot())
and I get (PyScripter alt+f9 external run command):
code:
Commandline: C:\Python26\python.exe D:\DEVELO~1\wagen\wagen.py
Workingdirectory: D:\Development\wagen
Timeout: 0 ms

Traceback (most recent call last):
  File "D:\DEVELO~1\wagen\wagen.py", line 22, in <module>
    page = lxml.html.parse("http://www.usccb.org/nab/032110.shtml")
  File "C:\Python26\lib\site-packages\lxml\html\__init__.py", line 661, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67740)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 563, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64060)
IOError: Error reading file 'http://www.usccb.org/nab/032110.shtml': failed to load external entity "http://www.usccb.org/nab/032110.shtml"

Process "Pyhton Interpreter" terminated, ExitCode: 00000001

Lurchington fucked around with this message at 05:31 on Mar 3, 2010

Jonnty
Aug 2, 2007

The enemy has become a flaming star!

Lurchington posted:

I could be misunderstanding, but I'm not using (I assume you mean the etree that comes bundled with python) etree, this is with the lxml install from http://codespeak.net/lxml/installation.html#ms-windows
If etree does html I'll use it for sure.

edit: something must be hosed in my install or I'm unfortunately glossing over important details. I've only used xml parts of etree before and I had thought I was relatively comfortable with the api.

I tried to make it as barebones as possible:
http://www.pastebin.org/100314
code:
if __name__ == "__main__":

    page = lxml.html.parse("http://www.usccb.org/nab/032110.shtml")
    print lxml.html.tostring(page.getroot())
and I get (PyScripter alt+f9 external run command):
code:
Commandline: C:\Python26\python.exe D:\DEVELO~1\wagen\wagen.py
Workingdirectory: D:\Development\wagen
Timeout: 0 ms

Traceback (most recent call last):
  File "D:\DEVELO~1\wagen\wagen.py", line 22, in <module>
    page = lxml.html.parse("http://www.usccb.org/nab/032110.shtml")
  File "C:\Python26\lib\site-packages\lxml\html\__init__.py", line 661, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67740)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 563, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64060)
IOError: Error reading file 'http://www.usccb.org/nab/032110.shtml': failed to load external entity "http://www.usccb.org/nab/032110.shtml"

Process "Pyhton Interpreter" terminated, ExitCode: 00000001

Just, you know, sticking my neck out here but I'm guessing it can't load the page for whatever reason. Also "Pyhton Interpreter"?

tef
May 30, 2004

-> some l-system crap ->

Lurchington posted:

I could be misunderstanding, but I'm not using (I assume you mean the etree that comes bundled with python) etree, this is with the lxml install from http://codespeak.net/lxml/installation.html#ms-windows
If etree does html I'll use it for sure.

as in lxml.etree.tounicode(...)

quote:

edit: something must be hosed in my install or I'm unfortunately glossing over important details. I've only used xml parts of etree before and I had thought I was relatively comfortable with the api.

try page = lxml.etree.parse("http://cocks/", lxml.etree.HTMLParser()) instead. your example doesn't work on my installation (no html module in lxml...).

(also, lxml is good for parsing xml, but you might want to try something like curl or urllib2 for fetching it.)

Lurchington
Jan 2, 2003

Forums Dragoon

Jonnty posted:

Also "Pyhton Interpreter"?

PyScripter is definitely my favorite windows IDE, but there's one of two idiosyncracies there. That's what's printed with an external run specified (alt+F9)

And the link certainly could be that it isn't loading (in my original implementation, I did use urllib2 to get the url, which was successfully opened), but I was just trying to repeat Scaevolus's succesfull attempt with the same syntax.

I'll try the different parse option and the tounicode parts this afternoon, thanks for the suggestions.

the wizards beard
Apr 15, 2007
Reppin

4 LIFE 4 REAL
e: nvm, this is serious overkill

the wizards beard fucked around with this message at 17:25 on Mar 3, 2010

Jonnty
Aug 2, 2007

The enemy has become a flaming star!

Lurchington posted:

PyScripter is definitely my favorite windows IDE, but there's one of two idiosyncracies there. That's what's printed with an external run specified (alt+F9)

And the link certainly could be that it isn't loading (in my original implementation, I did use urllib2 to get the url, which was successfully opened), but I was just trying to repeat Scaevolus's succesfull attempt with the same syntax.

I'll try the different parse option and the tounicode parts this afternoon, thanks for the suggestions.

For future reference, though, do make an attempt to understand error messages - don't just go 'welp, program guts are all over my screen, time to panic'. This:

code:
IOError: Error reading file 'http://www.usccb.org/nab/032110.shtml': failed to load external entity "http://www.usccb.org/nab/032110.shtml"
isn't horrendously cryptic.

Lurchington
Jan 2, 2003

Forums Dragoon
The point I was making, was that I used the same code as previous poster, but mine errored. While lxml clearly couldn't open the link, and had a relatively clear error message, I don't think it's a stretch to consider that something else was the actual root cause.

If you simply wanted an opportunity to say "don't freak out and actually read error messages," fine, I'm right there with you.

Lurchington fucked around with this message at 18:12 on Mar 3, 2010

Lurchington
Jan 2, 2003

Forums Dragoon
Alright, I'm fine to not talk about this anymore since I gave up on the original project and I don't want to derail the thread too bad, but here's where I'm at:

Scaevolus posted:

It works fine for me.
code:
>>> from lxml import html
>>> h = html.parse("http://www.usccb.org/nab/032110.shtml")
>>> print html.tostring(h.getroot())
<html><head><title>USCCB - (NAB) - March 21, 2010</title>

[[REDACTED]]

<!-- comment out no need for this as months are all over the left margin
        <div class="menutitle" onclick="SwitchMenu('sub3')">Readings and Psalms for the Month</a></div>
        <span class="submenu" id="sub3">


ok, I'm at a separate computer that has lxml 2.2.2 installed and got:
code:
Python 2.6.2 (r262:71600, Apr 21 2009, 15:05:37) [MSC v.1500 32 bit (Intel)] on win32
>>> from lxml import html
>>> h = html.parse("http://www.usccb.org/nab/032110.shtml")
>>> print html.tostring(h.getroot())

Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    print html.tostring(h.getroot())
  File "C:\Python26\lib\site-packages\lxml\html\__init__.py", line 1426, in tostring
    encoding=encoding)
  File "lxml.etree.pyx", line 2630, in lxml.etree.tostring (src/lxml/lxml.etree.c:49093)
  File "serializer.pxi", line 124, in lxml.etree._tostring (src/lxml/lxml.etree.c:78704)
  File "serializer.pxi", line 149, in lxml.etree._raiseSerialisationError (src/lxml/lxml.etree.c:78963)
SerialisationError: IO_ENCODER
>>> 
Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    print html.tostring(h.getroot())
  File "C:\Python26\lib\site-packages\lxml\html\__init__.py", line 1426, in tostring
    encoding=encoding)
  File "lxml.etree.pyx", line 2630, in lxml.etree.tostring (src/lxml/lxml.etree.c:49093)
  File "serializer.pxi", line 124, in lxml.etree._tostring (src/lxml/lxml.etree.c:78704)
  File "serializer.pxi", line 149, in lxml.etree._raiseSerialisationError (src/lxml/lxml.etree.c:78963)
SerialisationError: IO_ENCODER
>>> 
I then uninstalled lxml 2.2.2 and installed 2.2.4 using lxml-2.2.4.win32-py2.6.exe and had got the following two errors:

(original form used by Scaev)
code:
>>> from lxml import html
>>> h = html.parse("http://www.usccb.org/nab/032110.shtml")

Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    h = html.parse("http://www.usccb.org/nab/032110.shtml")
  File "C:\Python26\lib\site-packages\lxml\html\__init__.py", line 661, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67740)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 563, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64060)
IOError: Error reading file 'http://www.usccb.org/nab/032110.shtml': failed to load external entity "http://www.usccb.org/nab/032110.shtml"
(and after putting urllib2.urlopen in the loop)
code:
>>> from lxml import html
>>> import urllib2
>>> h = html.parse(urllib2.urlopen("http://www.usccb.org/nab/032110.shtml"))
>>> print html.tostring(h.getroot())

Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    print html.tostring(h.getroot())
  File "C:\Python26\lib\site-packages\lxml\html\__init__.py", line 1442, in tostring
    encoding=encoding)
  File "lxml.etree.pyx", line 2624, in lxml.etree.tostring (src/lxml/lxml.etree.c:49097)
  File "serializer.pxi", line 124, in lxml.etree._tostring (src/lxml/lxml.etree.c:78863)
  File "serializer.pxi", line 149, in lxml.etree._raiseSerialisationError (src/lxml/lxml.etree.c:79122)
SerialisationError: IO_ENCODER
for fun, I did a read() on the urlopen object and it looks fine. I also tried lxml.html.fromstring(urllib2.urlopen(url).read()) and got the same SerialisationError.

and to answer tef: using
code:
import lxml
import urllib2
from lxml import html
h = html.parse(urllib2.urlopen("http://www.usccb.org/nab/032110.shtml"))
print lxml.etree.tounicode(h.getroot())
yielded:
code:
Traceback (most recent call last):
  File "C:\Python26\m1.py", line 5, in <module>
    print lxml.etree.tounicode(h.getroot())
  File "lxml.etree.pyx", line 2670, in lxml.etree.tounicode (src/lxml/lxml.etree.c:49402)
  File "serializer.pxi", line 128, in lxml.etree._tostring (src/lxml/lxml.etree.c:78896)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x95 in position 6303: unexpected code byte

Lurchington fucked around with this message at 18:54 on Mar 3, 2010

Stabby McDamage
Dec 11, 2005

Doctor Rope
Okay, here's something dead simple I can't find an elegant way to do.

I want to set a slice of a list to a single constant. My naive guess:

list[0:15] = 100

No dice. Is there a way to do this without a loop or constructing a list of the same constant repeated N times for the sole purpose of pairing to the list slice?

EDIT: In case it's not clear, I'm thinking of something analagous to memset() in C.

MaberMK
Feb 1, 2008

BFFs

Stabby McDamage posted:

Okay, here's something dead simple I can't find an elegant way to do.

I want to set a slice of a list to a single constant. My naive guess:

list[0:15] = 100

No dice. Is there a way to do this without a loop or constructing a list of the same constant repeated N times for the sole purpose of pairing to the list slice?

EDIT: In case it's not clear, I'm thinking of something analagous to memset() in C.

code:
list[0:15] = [100] * 15
e: Actually, since you may want to generalize that to an arbitrary range, this would be more robust:

code:
list[x:y] = [value] * len(list[x:y])

MaberMK fucked around with this message at 19:43 on Mar 3, 2010

No Safe Word
Feb 26, 2005

MaberMK posted:

code:
list[0:15] = [100] * 15
e: Actually, since you may want to generalize that to an arbitrary range, this would be more robust:

code:
list[x:y] = [value] * len(list[x:y])

Or just list[x:y] = [value] * (y-x) :v:

MaberMK
Feb 1, 2008

BFFs

No Safe Word posted:

Or just list[x:y] = [value] * (y-x) :v:

hurrr, color me retarded. Stabby, do it this way.

Stabby McDamage
Dec 11, 2005

Doctor Rope

No Safe Word posted:

Or just list[x:y] = [value] * (y-x) :v:

Isn't that going to create a scratch list that's y-x entries long? Or is there some magic there I don't see? It doesn't matter for what I was doing, but it seems really inefficient for large values of y-x. I ended up doing:

for i in range(x,y): a[i] = value



Anyway, here's my next question. Multiple constructors: how do I make them elegant?

Right now I have my __init__ be the root case that the user will never directly call. Then I have various @classmethod-decorated functions that call the constructor, then diddle with the new object before finally returning it. Is this the right way to do it?

Avenging Dentist
Oct 1, 2005

oh my god is that a circular saw that does not go in my mouth aaaaagh

Stabby McDamage posted:

Isn't that going to create a scratch list that's y-x entries long? Or is there some magic there I don't see? It doesn't matter for what I was doing, but it seems really inefficient for large values of y-x.

If you're worried about efficiency, why are you using a list?

Stabby McDamage
Dec 11, 2005

Doctor Rope
I just tested it, and it does make a y-x list and then throw it away. What's worse is that my for loop uses even more memory and runs even slower! I think the range() in my for loop is literally making a list in memory -- I thought it was supposed to be a generator?

To reproduce:
code:
>>> n = 100000000
>>> a = [1] * n
(memory usage goes to 384MB)
>>> m = 90000000
>>> a[0:m] = [2]*m
(memory briefly blips up to 752MB, then falls back to 384MB)
>>> for i in range(0,m): a[i]=2
(memory goes through the roof and starts thrashing until I interrupt it)
Python 2.5.2, for the record.

Stabby McDamage
Dec 11, 2005

Doctor Rope

Avenging Dentist posted:

If you're worried about efficiency, why are you using a list?

What else other than a list would I use for array-like functionality in Python?

Or are you asking "if you care about large arrays, why are you writing Python?"

Avenging Dentist
Oct 1, 2005

oh my god is that a circular saw that does not go in my mouth aaaaagh

Stabby McDamage posted:

What else other than a list would I use for array-like functionality in Python?

NumPy. Which incidentally does what you tried to do in the beginning.

Also in Python 2.x, range returns a list. You want xrange.

taqueso
Mar 8, 2004


:911:
:wookie: :thermidor: :wookie:
:dehumanize:

:pirate::hf::tinfoil:

Won't replacing elements in a loop cause python to traverse the list y-x times, pretty much negating any benefit of not creating a scratch list?

Avenging Dentist
Oct 1, 2005

oh my god is that a circular saw that does not go in my mouth aaaaagh

taqueso posted:

Won't replacing elements in a loop cause python to traverse the list y-x times, pretty much negating any benefit of not creating a scratch list?

What?

taqueso
Mar 8, 2004


:911:
:wookie: :thermidor: :wookie:
:dehumanize:

:pirate::hf::tinfoil:

Avenging Dentist posted:

What?

Maybe cpython optimizes this, but I would think

code:
for i in range(x,y): a[i] = value
will result in:
1. traverse the list to get to element a[x]
2. replace a[x] with value
3. increment x and go to 1

But
code:
Or just list[x:y] = [value] * (y-x)
will only need to traverse the list once.

I am not a python expert by any stretch of the imagination and I would love to find out that python is smart enough to remember the previous element in the for loop.

Avenging Dentist
Oct 1, 2005

oh my god is that a circular saw that does not go in my mouth aaaaagh
What are you talking about? Why would Python need to traverse a list to get to an integer offset in an array?

taqueso
Mar 8, 2004


:911:
:wookie: :thermidor: :wookie:
:dehumanize:

:pirate::hf::tinfoil:

Avenging Dentist posted:

What are you talking about? Why would Python need to traverse a list to get to an integer offset in an array?

You are right. For some reason I thought that a linked list is used to represent the list datatype, but as you say it is an array and an element can be found in constant time.

Janitor Prime
Jan 22, 2004

PC LOAD LETTER

What da fuck does that mean

Fun Shoe
Why the hell would anyone implement the [] operator if the underlying implementation was a linked list! :psyduck:

Scaevolus
Apr 16, 2007

Lurchington posted:

Alright, I'm fine to not talk about this anymore since I gave up on the original project and I don't want to derail the thread too bad, but here's where I'm at:

code:
from lxml import html
h = html.parse("http://www.usccb.org/nab/032110.shtml")
print html.tostring(h, encoding='utf-8')

tripwire
Nov 19, 2004

        ghost flow
Lurchington, I think whatever activepython or pyshell fancyness you are using is making this harder for you by obfuscating the error messages a little.

I assume what should be UnicodeDecodeError exceptions are instead for you getting translated to those serialization errors by whatever pyshell or activepython stuff you are using.

It's very important to keep in mind that a windows console will ALWAYS give you unicode encoding errors if you try to print out unicode characters on it, because the native character encoding of the windows command interpreter is ASCII!

Thats why you are getting that error doing what Tef suggested.

ErIog
Jul 11, 2001

:nsacloud:

tripwire posted:

Lurchington, I think whatever activepython or pyshell fancyness you are using is making this harder for you by obfuscating the error messages a little.

I assume what should be UnicodeDecodeError exceptions are instead for you getting translated to those serialization errors by whatever pyshell or activepython stuff you are using.

It's very important to keep in mind that a windows console will ALWAYS give you unicode encoding errors if you try to print out unicode characters on it, because the native character encoding of the windows command interpreter is ASCII!

Thats why you are getting that error doing what Tef suggested.

I will attest to this. Unicode always makes everything more complicated, but it's all soluble if you remember this when dealing with Unicode. I'm on a project now that requires parsing a bunch of Unicode XML, and I've had more trouble with my debug reporting than the real problem solving.

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

ErIog posted:

I will attest to this. Unicode always makes everything more complicated, but it's all soluble if you remember this when dealing with Unicode. I'm on a project now that requires parsing a bunch of Unicode XML, and I've had more trouble with my debug reporting than the real problem solving.

Agreed. I'm writing some stuff now that takes a bunch of unicode from web services and debugging on windows is bullshit because of the ASCII console.

Lurchington
Jan 2, 2003

Forums Dragoon

tripwire posted:

Lurchington, I think whatever activepython or pyshell fancyness you are using is making this harder for you by obfuscating the error messages a little.

That's likely, I have an Mac and Linux test box around here that I can use, but all my windows machines are using ActivePython.


Scaevolus posted:

code:
from lxml import html
h = html.parse("http://www.usccb.org/nab/032110.shtml")
print html.tostring(h, encoding='utf-8')

just using the url "failed to load external entity" but using urllib2.urlopen on the url with the explicit encoding does seem to provide good results.

Upon inspection, there's like 80 of these lines:
code:
<td> </td>
Thanks everyone, but this was frustrating because I had thought I had already learned my tough unicode lessons thanks to a lot work with XML/ElementTree stuff. I guess my takeaway here may be more on the Python middleware side of things.

Stabby McDamage
Dec 11, 2005

Doctor Rope

Avenging Dentist posted:

NumPy. Which incidentally does what you tried to do in the beginning.

Also in Python 2.x, range returns a list. You want xrange.

I'll keep that in mind for the future. I assume Numpy is written in C on the backend?

Regarding xrange(): why does range() exist given the existence of xrange()? Are there situations where the generator nature of it would be a problem?

Also, there's something else I'm curious about. Why did my for loop take so much more memory than the list itself? It seems like range(0,90e6) should be a bit smaller than [1] * 100e6, but instead my for loop ate all the RAM available (~1.5GB) and started thrashing. Weird.

Avenging Dentist
Oct 1, 2005

oh my god is that a circular saw that does not go in my mouth aaaaagh

Stabby McDamage posted:

Also, there's something else I'm curious about. Why did my for loop take so much more memory than the list itself? It seems like range(0,90e6) should be a bit smaller than [1] * 100e6, but instead my for loop ate all the RAM available (~1.5GB) and started thrashing. Weird.

Because [x]*N creates a list with N references to x (i.e. x[i] is x[j] == True for all i,j in [0,N)), whereas range(N) creates a list with N distinct integers, all* of which are heap-allocated as separate objects. It should be obvious why the former takes less space than the latter.

Basically a Python list is an array of pointers to PyObjects, so for an N-element list on a 32-bit system, there are 4*N bytes of data taken up by that (plus 12 bytes for the refcount, pointer to type object, and length). Each integer object in Python takes 12 bytes of data (refcount, pointer to type object, and value), so a list of N distinct Python ints costs 16*N+12 bytes of data, excluding malloc overhead. A list of N identical Python ints costs 4*N+24 bytes.

* However, integers in the range [-5, 256] are cached by Python and don't involve an allocation.

Avenging Dentist fucked around with this message at 06:44 on Mar 4, 2010

nbv4
Aug 21, 2002

by Duchess Gummybuns

Stabby McDamage posted:

Right now I have my __init__ be the root case that the user will never directly call. Then I have various @classmethod-decorated functions that call the constructor, then diddle with the new object before finally returning it. Is this the right way to do it?

I'm not a python expert, but thats exactly how I do it.

Stabby McDamage
Dec 11, 2005

Doctor Rope

Avenging Dentist posted:

Because [x]*N creates a list with N references to x (i.e. x[i] is x[j] == True for all i,j in [0,N)), whereas range(N) creates a list with N distinct integers, all* of which are heap-allocated as separate objects. It should be obvious why the former takes less space than the latter.

Basically a Python list is an array of pointers to PyObjects, so for an N-element list on a 32-bit system, there are 4*N bytes of data taken up by that (plus 12 bytes for the refcount, pointer to type object, and length). Each integer object in Python takes 12 bytes of data (refcount, pointer to type object, and value), so a list of N distinct Python ints costs 16*N+12 bytes of data, excluding malloc overhead. A list of N identical Python ints costs 4*N+24 bytes.

* However, integers in the range [-5, 256] are cached by Python and don't involve an allocation.

Individual integers are heap-allocated full python objects, with a refcount and everything? Gross. I mean, I see why you'd want to implement it that way, but drat.

BigRedDot
Mar 6, 2008

Stabby McDamage posted:

I assume Numpy is written in C on the backend?
Yes, it is (I used to work with the guy who wrote it).

Adbot
ADBOT LOVES YOU

Scaevolus
Apr 16, 2007

I ran into this bug yesterday.

code:
class LeakyDict(dict):
    def __init__(self):
        dict.__init__(self)
        self.__dict__ = self


while True:
    a = LeakyDict()
I'm annoyed that they've had a patch to fix it for 3+ years and haven't committed it yet.

Scaevolus fucked around with this message at 00:54 on Mar 6, 2010

  • Locked thread