Python information and short questions megathread.

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›484 »

Avenging Dentist: Oct 1, 2005; oh my god is that a circular saw that does not go in my mouth aaaaagh

BeefofAges posted:

Yeah, that's what I said to do, but the guy who asked me was like "I'm doing this in a lot of places, so it would be nice to make it compact". Heh.

So use shorter variable names. Or use a semicolon to put it on one physical line.

# ? Mar 2, 2010 21:53

Adbot: ADBOT LOVES YOU

# ? May 28, 2024 02:58

hey mom its 420: May 12, 2007

Python tries to avoid this kind of stuff because of the whole readability thing and that there should be only one way to write something. I guess you could hack something together by accessing the global dictionary, but blehh.

# ? Mar 2, 2010 21:54

Lurchington: Jan 2, 2003; Forums Dragoon

tef posted:

edit: you said you were using lxml.html? can you paste the error message?

Scaevolus posted:

It works fine for me.

I'll try it again in the next day or so. I was using lxml.html and I was getting a SerialisationError when I tried to do a toString() on anything. If I tried to mess with specifying a unicode encoding, it gave me a Unicode error on some random character saying the ordinal was out of range(128).

This was using 2.2.6 lxml and 2.6.2 active python on windows XP.

# ? Mar 2, 2010 22:05

ATLbeer: Sep 26, 2004; Über nerd

BeefofAges posted:

Yeah, that's what I said to do, but the guy who asked me was like "I'm doing this in a lot of places, so it would be nice to make it compact". Heh.

Doing something in a lot of places eh? Repeating yourself? If only there was a function?

make_string_and_append("make", "append")

Haha... but, no clarity is always important over cleverness. Just set the string and append.

# ? Mar 2, 2010 23:37

tef: May 30, 2004; -> some l-system crap ->

Lurchington posted:

I'll try it again in the next day or so. I was using lxml.html and I was getting a SerialisationError when I tried to do a toString() on anything. If I tried to mess with specifying a unicode encoding, it gave me a Unicode error on some random character saying the ordinal was out of range(128).

This was using 2.2.6 lxml and 2.6.2 active python on windows XP.

try etree.tounicode(foo) ? can you pastebin a small example showing the error?

# ? Mar 3, 2010 01:44

TOO SCSI FOR MY CAT: Oct 12, 2008; this is what happens when you take UI design away from engineers and give it to a bunch of hipster art student "designers"

Dijkstracula posted:

I tried Beautiful Soup this afternoon but it broke -- that is, threw some exception deep in its bowls -- on a reasonably trivial page.

lxml looks like a serious pain in that it has to be installed rather than my just dropping it in my working directory, and I'm working across OSs and need it to be basically self-contained...this is really preposterous, so I gather there's really no good solution for a reasonably simple thing like this? It's bullshit like this that causes people to just regex the poo poo out of HTML instead of using proper parsers.

edit: here's my team's torture test site. If you can find something that can make sense of it, let me know

You could try html5lib, which uses more-or-less the same handling for lovely pages that modern browsers do. Your test page is currently timing out for me, but html5lib works with like a million possible parser implementations and there's no way your page is so hosed up that it'll fail to parse.

# ? Mar 3, 2010 04:30

Lurchington: Jan 2, 2003; Forums Dragoon

tef posted:

try etree.tounicode(foo) ? can you pastebin a small example showing the error?

I could be misunderstanding, but I'm not using (I assume you mean the etree that comes bundled with python) etree, this is with the lxml install from http://codespeak.net/lxml/installation.html#ms-windows
If etree does html I'll use it for sure.

edit: something must be hosed in my install or I'm unfortunately glossing over important details. I've only used xml parts of etree before and I had thought I was relatively comfortable with the api.

I tried to make it as barebones as possible:
http://www.pastebin.org/100314

code:

if __name__ == "__main__":

    page = lxml.html.parse("http://www.usccb.org/nab/032110.shtml")
    print lxml.html.tostring(page.getroot())

and I get (PyScripter alt+f9 external run command):

code:

Commandline: C:\Python26\python.exe D:\DEVELO~1\wagen\wagen.py
Workingdirectory: D:\Development\wagen
Timeout: 0 ms

Traceback (most recent call last):
  File "D:\DEVELO~1\wagen\wagen.py", line 22, in <module>
    page = lxml.html.parse("http://www.usccb.org/nab/032110.shtml")
  File "C:\Python26\lib\site-packages\lxml\html\__init__.py", line 661, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67740)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 563, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64060)
IOError: Error reading file 'http://www.usccb.org/nab/032110.shtml': failed to load external entity "http://www.usccb.org/nab/032110.shtml"

Process "Pyhton Interpreter" terminated, ExitCode: 00000001

Lurchington fucked around with this message at 05:31 on Mar 3, 2010

# ? Mar 3, 2010 05:17

Jonnty: Aug 2, 2007; The enemy has become a flaming star!

Lurchington posted:

code:

if __name__ == "__main__":

    page = lxml.html.parse("http://www.usccb.org/nab/032110.shtml")
    print lxml.html.tostring(page.getroot())

and I get (PyScripter alt+f9 external run command):

code:

Commandline: C:\Python26\python.exe D:\DEVELO~1\wagen\wagen.py
Workingdirectory: D:\Development\wagen
Timeout: 0 ms

Traceback (most recent call last):
  File "D:\DEVELO~1\wagen\wagen.py", line 22, in <module>
    page = lxml.html.parse("http://www.usccb.org/nab/032110.shtml")
  File "C:\Python26\lib\site-packages\lxml\html\__init__.py", line 661, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67740)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 563, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64060)
IOError: Error reading file 'http://www.usccb.org/nab/032110.shtml': failed to load external entity "http://www.usccb.org/nab/032110.shtml"

Process "Pyhton Interpreter" terminated, ExitCode: 00000001

Just, you know, sticking my neck out here but I'm guessing it can't load the page for whatever reason. Also "Pyhton Interpreter"?

# ? Mar 3, 2010 10:25

tef: May 30, 2004; -> some l-system crap ->

Lurchington posted:

I could be misunderstanding, but I'm not using (I assume you mean the etree that comes bundled with python) etree, this is with the lxml install from http://codespeak.net/lxml/installation.html#ms-windows
If etree does html I'll use it for sure.

as in lxml.etree.tounicode(...)

quote:

edit: something must be hosed in my install or I'm unfortunately glossing over important details. I've only used xml parts of etree before and I had thought I was relatively comfortable with the api.

try page = lxml.etree.parse("http://cocks/", lxml.etree.HTMLParser()) instead. your example doesn't work on my installation (no html module in lxml...).

(also, lxml is good for parsing xml, but you might want to try something like curl or urllib2 for fetching it.)

# ? Mar 3, 2010 13:35

Lurchington: Jan 2, 2003; Forums Dragoon

Jonnty posted:

Also "Pyhton Interpreter"?

PyScripter is definitely my favorite windows IDE, but there's one of two idiosyncracies there. That's what's printed with an external run specified (alt+F9)

And the link certainly could be that it isn't loading (in my original implementation, I did use urllib2 to get the url, which was successfully opened), but I was just trying to repeat Scaevolus's succesfull attempt with the same syntax.

I'll try the different parse option and the tounicode parts this afternoon, thanks for the suggestions.

# ? Mar 3, 2010 13:46

the wizards beard: Apr 15, 2007; Reppin

4 LIFE 4 REAL

e: nvm, this is serious overkill

the wizards beard fucked around with this message at 17:25 on Mar 3, 2010

# ? Mar 3, 2010 16:46

Jonnty: Aug 2, 2007; The enemy has become a flaming star!

Lurchington posted:

PyScripter is definitely my favorite windows IDE, but there's one of two idiosyncracies there. That's what's printed with an external run specified (alt+F9)

And the link certainly could be that it isn't loading (in my original implementation, I did use urllib2 to get the url, which was successfully opened), but I was just trying to repeat Scaevolus's succesfull attempt with the same syntax.

I'll try the different parse option and the tounicode parts this afternoon, thanks for the suggestions.

For future reference, though, do make an attempt to understand error messages - don't just go 'welp, program guts are all over my screen, time to panic'. This:

code:

IOError: Error reading file 'http://www.usccb.org/nab/032110.shtml': failed to load external entity "http://www.usccb.org/nab/032110.shtml"

isn't horrendously cryptic.

# ? Mar 3, 2010 17:50

Lurchington: Jan 2, 2003; Forums Dragoon

The point I was making, was that I used the same code as previous poster, but mine errored. While lxml clearly couldn't open the link, and had a relatively clear error message, I don't think it's a stretch to consider that something else was the actual root cause.

If you simply wanted an opportunity to say "don't freak out and actually read error messages," fine, I'm right there with you.

Lurchington fucked around with this message at 18:12 on Mar 3, 2010

# ? Mar 3, 2010 18:10

Lurchington: Jan 2, 2003; Forums Dragoon

Alright, I'm fine to not talk about this anymore since I gave up on the original project and I don't want to derail the thread too bad, but here's where I'm at:

Scaevolus posted:

It works fine for me.

code:

>>> from lxml import html
>>> h = html.parse("http://www.usccb.org/nab/032110.shtml")
>>> print html.tostring(h.getroot())
<html><head><title>USCCB - (NAB) - March 21, 2010</title>

[[REDACTED]]

<!-- comment out no need for this as months are all over the left margin
        <div class="menutitle" onclick="SwitchMenu('sub3')">Readings and Psalms for the Month</a></div>
        <span class="submenu" id="sub3">

ok, I'm at a separate computer that has lxml 2.2.2 installed and got:

code:

Python 2.6.2 (r262:71600, Apr 21 2009, 15:05:37) [MSC v.1500 32 bit (Intel)] on win32
>>> from lxml import html
>>> h = html.parse("http://www.usccb.org/nab/032110.shtml")
>>> print html.tostring(h.getroot())

Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    print html.tostring(h.getroot())
  File "C:\Python26\lib\site-packages\lxml\html\__init__.py", line 1426, in tostring
    encoding=encoding)
  File "lxml.etree.pyx", line 2630, in lxml.etree.tostring (src/lxml/lxml.etree.c:49093)
  File "serializer.pxi", line 124, in lxml.etree._tostring (src/lxml/lxml.etree.c:78704)
  File "serializer.pxi", line 149, in lxml.etree._raiseSerialisationError (src/lxml/lxml.etree.c:78963)
SerialisationError: IO_ENCODER
>>> 
Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    print html.tostring(h.getroot())
  File "C:\Python26\lib\site-packages\lxml\html\__init__.py", line 1426, in tostring
    encoding=encoding)
  File "lxml.etree.pyx", line 2630, in lxml.etree.tostring (src/lxml/lxml.etree.c:49093)
  File "serializer.pxi", line 124, in lxml.etree._tostring (src/lxml/lxml.etree.c:78704)
  File "serializer.pxi", line 149, in lxml.etree._raiseSerialisationError (src/lxml/lxml.etree.c:78963)
SerialisationError: IO_ENCODER
>>>

I then uninstalled lxml 2.2.2 and installed 2.2.4 using lxml-2.2.4.win32-py2.6.exe and had got the following two errors:

(original form used by Scaev)

code:

>>> from lxml import html
>>> h = html.parse("http://www.usccb.org/nab/032110.shtml")

Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    h = html.parse("http://www.usccb.org/nab/032110.shtml")
  File "C:\Python26\lib\site-packages\lxml\html\__init__.py", line 661, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67740)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 563, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64060)
IOError: Error reading file 'http://www.usccb.org/nab/032110.shtml': failed to load external entity "http://www.usccb.org/nab/032110.shtml"

(and after putting urllib2.urlopen in the loop)

code:

>>> from lxml import html
>>> import urllib2
>>> h = html.parse(urllib2.urlopen("http://www.usccb.org/nab/032110.shtml"))
>>> print html.tostring(h.getroot())

Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    print html.tostring(h.getroot())
  File "C:\Python26\lib\site-packages\lxml\html\__init__.py", line 1442, in tostring
    encoding=encoding)
  File "lxml.etree.pyx", line 2624, in lxml.etree.tostring (src/lxml/lxml.etree.c:49097)
  File "serializer.pxi", line 124, in lxml.etree._tostring (src/lxml/lxml.etree.c:78863)
  File "serializer.pxi", line 149, in lxml.etree._raiseSerialisationError (src/lxml/lxml.etree.c:79122)
SerialisationError: IO_ENCODER

for fun, I did a read() on the urlopen object and it looks fine. I also tried lxml.html.fromstring(urllib2.urlopen(url).read()) and got the same SerialisationError.

and to answer tef: using

code:

import lxml
import urllib2
from lxml import html
h = html.parse(urllib2.urlopen("http://www.usccb.org/nab/032110.shtml"))
print lxml.etree.tounicode(h.getroot())

yielded:

code:

Traceback (most recent call last):
  File "C:\Python26\m1.py", line 5, in <module>
    print lxml.etree.tounicode(h.getroot())
  File "lxml.etree.pyx", line 2670, in lxml.etree.tounicode (src/lxml/lxml.etree.c:49402)
  File "serializer.pxi", line 128, in lxml.etree._tostring (src/lxml/lxml.etree.c:78896)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x95 in position 6303: unexpected code byte

Lurchington fucked around with this message at 18:54 on Mar 3, 2010

# ? Mar 3, 2010 18:42

Stabby McDamage: Dec 11, 2005; Doctor Rope

Okay, here's something dead simple I can't find an elegant way to do.

I want to set a slice of a list to a single constant. My naive guess:
list[0:15] = 100
No dice. Is there a way to do this without a loop or constructing a list of the same constant repeated N times for the sole purpose of pairing to the list slice?

EDIT: In case it's not clear, I'm thinking of something analagous to memset() in C.

# ? Mar 3, 2010 19:30

MaberMK: Feb 1, 2008; BFFs

Stabby McDamage posted:

Okay, here's something dead simple I can't find an elegant way to do.

I want to set a slice of a list to a single constant. My naive guess:
list[0:15] = 100
No dice. Is there a way to do this without a loop or constructing a list of the same constant repeated N times for the sole purpose of pairing to the list slice?

EDIT: In case it's not clear, I'm thinking of something analagous to memset() in C.

code:

list[0:15] = [100] * 15

e: Actually, since you may want to generalize that to an arbitrary range, this would be more robust:

code:

list[x:y] = [value] * len(list[x:y])

MaberMK fucked around with this message at 19:43 on Mar 3, 2010

# ? Mar 3, 2010 19:39

No Safe Word: Feb 26, 2005

MaberMK posted:

code:
list[0:15] = [100] * 15
e: Actually, since you may want to generalize that to an arbitrary range, this would be more robust:
code:
list[x:y] = [value] * len(list[x:y])

Or just list[x:y] = [value] * (y-x) :v:

# ? Mar 3, 2010 20:58

MaberMK: Feb 1, 2008; BFFs

No Safe Word posted:

Or just list[x:y] = [value] * (y-x)

hurrr, color me retarded. Stabby, do it this way.

# ? Mar 3, 2010 22:22

Stabby McDamage: Dec 11, 2005; Doctor Rope

No Safe Word posted:

Or just list[x:y] = [value] * (y-x)

Isn't that going to create a scratch list that's y-x entries long? Or is there some magic there I don't see? It doesn't matter for what I was doing, but it seems really inefficient for large values of y-x. I ended up doing:

for i in range(x,y): a[i] = value

Anyway, here's my next question. Multiple constructors: how do I make them elegant?

Right now I have my __init__ be the root case that the user will never directly call. Then I have various @classmethod-decorated functions that call the constructor, then diddle with the new object before finally returning it. Is this the right way to do it?

# ? Mar 4, 2010 00:01

Avenging Dentist: Oct 1, 2005; oh my god is that a circular saw that does not go in my mouth aaaaagh

Stabby McDamage posted:

Isn't that going to create a scratch list that's y-x entries long? Or is there some magic there I don't see? It doesn't matter for what I was doing, but it seems really inefficient for large values of y-x.

If you're worried about efficiency, why are you using a list?

# ? Mar 4, 2010 00:10

Stabby McDamage: Dec 11, 2005; Doctor Rope

I just tested it, and it does make a y-x list and then throw it away. What's worse is that my for loop uses even more memory and runs even slower! I think the range() in my for loop is literally making a list in memory -- I thought it was supposed to be a generator?

To reproduce:

code:

>>> n = 100000000
>>> a = [1] * n
(memory usage goes to 384MB)
>>> m = 90000000
>>> a[0:m] = [2]*m
(memory briefly blips up to 752MB, then falls back to 384MB)
>>> for i in range(0,m): a[i]=2
(memory goes through the roof and starts thrashing until I interrupt it)

Python 2.5.2, for the record.

# ? Mar 4, 2010 00:12

Stabby McDamage: Dec 11, 2005; Doctor Rope

Avenging Dentist posted:

If you're worried about efficiency, why are you using a list?

What else other than a list would I use for array-like functionality in Python?

Or are you asking "if you care about large arrays, why are you writing Python?"

# ? Mar 4, 2010 00:13

Avenging Dentist: Oct 1, 2005; oh my god is that a circular saw that does not go in my mouth aaaaagh

Stabby McDamage posted:

What else other than a list would I use for array-like functionality in Python?

NumPy. Which incidentally does what you tried to do in the beginning.

Also in Python 2.x, range returns a list. You want xrange.

# ? Mar 4, 2010 00:24

taqueso: Mar 8, 2004

Won't replacing elements in a loop cause python to traverse the list y-x times, pretty much negating any benefit of not creating a scratch list?

# ? Mar 4, 2010 00:33

Avenging Dentist: Oct 1, 2005; oh my god is that a circular saw that does not go in my mouth aaaaagh

taqueso posted:

Won't replacing elements in a loop cause python to traverse the list y-x times, pretty much negating any benefit of not creating a scratch list?

What?

# ? Mar 4, 2010 00:44

taqueso: Mar 8, 2004

Avenging Dentist posted:

What?

Maybe cpython optimizes this, but I would think

code:

for i in range(x,y): a[i] = value

will result in:
1. traverse the list to get to element a[x]
2. replace a[x] with value
3. increment x and go to 1

But

code:

Or just list[x:y] = [value] * (y-x)

will only need to traverse the list once.

I am not a python expert by any stretch of the imagination and I would love to find out that python is smart enough to remember the previous element in the for loop.

# ? Mar 4, 2010 00:49

Avenging Dentist: Oct 1, 2005; oh my god is that a circular saw that does not go in my mouth aaaaagh

What are you talking about? Why would Python need to traverse a list to get to an integer offset in an array?

# ? Mar 4, 2010 00:50

taqueso: Mar 8, 2004

Avenging Dentist posted:

What are you talking about? Why would Python need to traverse a list to get to an integer offset in an array?

You are right. For some reason I thought that a linked list is used to represent the list datatype, but as you say it is an array and an element can be found in constant time.

# ? Mar 4, 2010 00:54

Janitor Prime: Jan 22, 2004; PC LOAD LETTER

What da fuck does that mean; Fun Shoe

Why the hell would anyone implement the [] operator if the underlying implementation was a linked list! :psyduck:

# ? Mar 4, 2010 01:09

Scaevolus: Apr 16, 2007

Lurchington posted:

Alright, I'm fine to not talk about this anymore since I gave up on the original project and I don't want to derail the thread too bad, but here's where I'm at:

code:

from lxml import html
h = html.parse("http://www.usccb.org/nab/032110.shtml")
print html.tostring(h, encoding='utf-8')

# ? Mar 4, 2010 01:41

tripwire: Nov 19, 2004; _{ghost flow}

Lurchington, I think whatever activepython or pyshell fancyness you are using is making this harder for you by obfuscating the error messages a little.

I assume what should be UnicodeDecodeError exceptions are instead for you getting translated to those serialization errors by whatever pyshell or activepython stuff you are using.

It's very important to keep in mind that a windows console will ALWAYS give you unicode encoding errors if you try to print out unicode characters on it, because the native character encoding of the windows command interpreter is ASCII!

Thats why you are getting that error doing what Tef suggested.

# ? Mar 4, 2010 02:35

ErIog: Jul 11, 2001

tripwire posted:

Lurchington, I think whatever activepython or pyshell fancyness you are using is making this harder for you by obfuscating the error messages a little.

I assume what should be UnicodeDecodeError exceptions are instead for you getting translated to those serialization errors by whatever pyshell or activepython stuff you are using.

It's very important to keep in mind that a windows console will ALWAYS give you unicode encoding errors if you try to print out unicode characters on it, because the native character encoding of the windows command interpreter is ASCII!

Thats why you are getting that error doing what Tef suggested.

I will attest to this. Unicode always makes everything more complicated, but it's all soluble if you remember this when dealing with Unicode. I'm on a project now that requires parsing a bunch of Unicode XML, and I've had more trouble with my debug reporting than the real problem solving.

# ? Mar 4, 2010 02:50

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

ErIog posted:

I will attest to this. Unicode always makes everything more complicated, but it's all soluble if you remember this when dealing with Unicode. I'm on a project now that requires parsing a bunch of Unicode XML, and I've had more trouble with my debug reporting than the real problem solving.

Agreed. I'm writing some stuff now that takes a bunch of unicode from web services and debugging on windows is bullshit because of the ASCII console.

# ? Mar 4, 2010 03:14

Lurchington: Jan 2, 2003; Forums Dragoon

tripwire posted:

Lurchington, I think whatever activepython or pyshell fancyness you are using is making this harder for you by obfuscating the error messages a little.

That's likely, I have an Mac and Linux test box around here that I can use, but all my windows machines are using ActivePython.

Scaevolus posted:

code:

from lxml import html
h = html.parse("http://www.usccb.org/nab/032110.shtml")
print html.tostring(h, encoding='utf-8')

just using the url "failed to load external entity" but using urllib2.urlopen on the url with the explicit encoding does seem to provide good results.

Upon inspection, there's like 80 of these lines:

code:

<td> </td>

Thanks everyone, but this was frustrating because I had thought I had already learned my tough unicode lessons thanks to a lot work with XML/ElementTree stuff. I guess my takeaway here may be more on the Python middleware side of things.

# ? Mar 4, 2010 03:50

Stabby McDamage: Dec 11, 2005; Doctor Rope

Avenging Dentist posted:

NumPy. Which incidentally does what you tried to do in the beginning.

Also in Python 2.x, range returns a list. You want xrange.

I'll keep that in mind for the future. I assume Numpy is written in C on the backend?

Regarding xrange(): why does range() exist given the existence of xrange()? Are there situations where the generator nature of it would be a problem?

Also, there's something else I'm curious about. Why did my for loop take so much more memory than the list itself? It seems like range(0,90e6) should be a bit smaller than [1] * 100e6, but instead my for loop ate all the RAM available (~1.5GB) and started thrashing. Weird.

# ? Mar 4, 2010 05:40

Avenging Dentist: Oct 1, 2005; oh my god is that a circular saw that does not go in my mouth aaaaagh

Stabby McDamage posted:

Also, there's something else I'm curious about. Why did my for loop take so much more memory than the list itself? It seems like range(0,90e6) should be a bit smaller than [1] * 100e6, but instead my for loop ate all the RAM available (~1.5GB) and started thrashing. Weird.

Because [x]*N creates a list with N references to x (i.e. x[i] is x[j] == True for all i,j in [0,N)), whereas range(N) creates a list with N distinct integers, all* of which are heap-allocated as separate objects. It should be obvious why the former takes less space than the latter.

Basically a Python list is an array of pointers to PyObjects, so for an N-element list on a 32-bit system, there are 4*N bytes of data taken up by that (plus 12 bytes for the refcount, pointer to type object, and length). Each integer object in Python takes 12 bytes of data (refcount, pointer to type object, and value), so a list of N distinct Python ints costs 16*N+12 bytes of data, excluding malloc overhead. A list of N identical Python ints costs 4*N+24 bytes.

* However, integers in the range [-5, 256] are cached by Python and don't involve an allocation.

Avenging Dentist fucked around with this message at 06:44 on Mar 4, 2010

# ? Mar 4, 2010 06:38

nbv4: Aug 21, 2002; by Duchess Gummybuns

Stabby McDamage posted:

Right now I have my __init__ be the root case that the user will never directly call. Then I have various @classmethod-decorated functions that call the constructor, then diddle with the new object before finally returning it. Is this the right way to do it?

I'm not a python expert, but thats exactly how I do it.

# ? Mar 4, 2010 07:07

Stabby McDamage: Dec 11, 2005; Doctor Rope

Avenging Dentist posted:

Because [x]*N creates a list with N references to x (i.e. x[i] is x[j] == True for all i,j in [0,N)), whereas range(N) creates a list with N distinct integers, all* of which are heap-allocated as separate objects. It should be obvious why the former takes less space than the latter.

Basically a Python list is an array of pointers to PyObjects, so for an N-element list on a 32-bit system, there are 4*N bytes of data taken up by that (plus 12 bytes for the refcount, pointer to type object, and length). Each integer object in Python takes 12 bytes of data (refcount, pointer to type object, and value), so a list of N distinct Python ints costs 16*N+12 bytes of data, excluding malloc overhead. A list of N identical Python ints costs 4*N+24 bytes.

* However, integers in the range [-5, 256] are cached by Python and don't involve an allocation.

Individual integers are heap-allocated full python objects, with a refcount and everything? Gross. I mean, I see why you'd want to implement it that way, but drat.

# ? Mar 4, 2010 15:32

BigRedDot: Mar 6, 2008

Stabby McDamage posted:

I assume Numpy is written in C on the backend?

Yes, it is (I used to work with the guy who wrote it).

# ? Mar 4, 2010 17:17

Adbot: ADBOT LOVES YOU

# ? May 28, 2024 02:58

Scaevolus: Apr 16, 2007

I ran into this bug yesterday.

code:

class LeakyDict(dict):
    def __init__(self):
        dict.__init__(self)
        self.__dict__ = self


while True:
    a = LeakyDict()

I'm annoyed that they've had a patch to fix it for 3+ years and haven't committed it yet.

Scaevolus fucked around with this message at 00:54 on Mar 6, 2010

# ? Mar 6, 2010 00:34

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›484 »