Python information and short questions megathread.

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›484 »

leterip: Aug 25, 2004

tef posted:

javascript/json only has double precision floating point

That's just javascript. json as a spec allows arbitrary sized integers/decimals.

# ? Aug 14, 2012 21:55

Adbot: ADBOT LOVES YOU

# ? May 20, 2024 00:08

tef: May 30, 2004; -> some l-system crap ->

leterip posted:

That's just javascript. json as a spec allows arbitrary sized integers/decimals.

Javascript and most parsers that i've encountered. Json is designed to be a subset of Javascript.

That is why we have things like this

quote:

To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".

# ? Aug 14, 2012 22:00

Cat Plus Plus: Apr 8, 2011

The King of Swag posted:

I've seen that before and the problem is that I need to save my data out as binary and be able to easily read it all back.

I second Protobuf for this. Makes future/backwards compatibility a lot easier, and sooner or later you will change the format and will have to load older files. Plus, language agnostic and portable.

JSON is fine if you don't want a binary file format.

# ? Aug 14, 2012 22:04

Suspicious Dish: Sep 24, 2011; 2020 is the year of linux on the desktop, bro; Fun Shoe

Python's JSON parser distinguishes between ints and floats based on the existence of a decimal literal.

http://hg.python.org/cpython/file/5395f96588d4/Lib/json/scanner.py#l48

# ? Aug 14, 2012 22:05

tef: May 30, 2004; -> some l-system crap ->

It also handles NaN, and +/- Infinity. Which is excluded from the spec

quote:

Numeric values that cannot be represented as sequences of digits
(such as Infinity and NaN) are not permitted.

# ? Aug 14, 2012 22:22

Suspicious Dish: Sep 24, 2011; 2020 is the year of linux on the desktop, bro; Fun Shoe

Sure, but that says nothing about the validity of the way Python's JSON scanner interprets number literals.

There's nothing explicit, in either the JSON RFC or otherwise that all numbers should be within the representable range of IEEE754 double-precision floats.

# ? Aug 14, 2012 22:25

tef: May 30, 2004; -> some l-system crap ->

I just don't like JSON :argh:

# ? Aug 14, 2012 22:43

Suspicious Dish: Sep 24, 2011; 2020 is the year of linux on the desktop, bro; Fun Shoe

Sure. You don't have to like it. That's your opinion, and I respect it.

But don't make up incorrect facts that are verifiably false. That's rude.

# ? Aug 14, 2012 22:52

tef: May 30, 2004; -> some l-system crap ->

Sorry, I just assumed that the design criteria of JSON, being a subset of Javascript, implied that the numbers were doubles, and the strings didn't handle things outside the BMP. The latter is explicit, the former is just my faulty recollection :3:

Also, you smell.

# ? Aug 14, 2012 23:03

ufarn: May 30, 2009

How would I go about writing a family tree in Python, and what charting library should I use for it?

# ? Aug 15, 2012 00:24

Cat Plus Plus: Apr 8, 2011

ufarn posted:

How would I go about writing a family tree in Python, and what charting library should I use for it?

I'd pick a graph library (like igraph or NetworkX), they usually come with drawing or at least Graphviz export.

# ? Aug 15, 2012 00:41

Sub Par: Jul 18, 2001; Dinosaur Gum

I have what may be a dumb question. I have a python script that is a small command line app. It's a proof of concept for what may end up being a more fully-featured command-line app. I need to demonstrate this to someone who is not technically savvy enough to install python and whatnot, and I can't just email them an exe.

I don't know much about web stuff but is there some website where I can upload the .py and have it run the thing in a browser so I can send this person to, e.g. http://myscript.idleonline.com or something?

# ? Aug 15, 2012 16:42

Cat Plus Plus: Apr 8, 2011

Ideone can run both Python 2 and 3 (and more). It's really handy.

# ? Aug 15, 2012 16:51

Sub Par: Jul 18, 2001; Dinosaur Gum

That is handy, thanks. But I need it to be interactive. It's along the lines of:

code:

while True:
        user_input = raw_input('Type Stuff -> ')
        if user_input in ('exit','quit'): break
	print do_awesome_stuff(user_input)

# ? Aug 15, 2012 17:02

Cat Plus Plus: Apr 8, 2011

You can put the input sequence in "stdin" field, and it'll work.
Alternatively, you can try repl.it, which actually runs the interpreter compiled to JavaScript inside the browser.

# ? Aug 15, 2012 17:12

evilentity: Jun 25, 2010

Or you could send the exe :v:

http://www.pyinstaller.org/

Works for me.

# ? Aug 15, 2012 23:42

OnceIWasAnOstrich: Jul 22, 2006

I am writing a webapp (using CherryPy) for visualization of some data sets. This isnt going to be on the internet, just run on the users computer. Currently I am loading the data (1-6+ integer lists 3m+ elements long) into memory. The the client POSTs and CherryPy responds with a json dump of the slice of the datasets that was requested. This works great because I only have one dataset and an obscene amount of RAM, both things that users may lack.

I'm trying to think of the most appropriate way to access the data as needed without loading everything into memory. I thought about sqlite but sql doesnt seem to have a way to easily deal with array-like data. Any suggestions?

# ? Aug 16, 2012 22:43

Sub Par: Jul 18, 2001; Dinosaur Gum

evilentity posted:

Or you could send the exe

http://www.pyinstaller.org/

Works for me.

I considered that but py2exe was building 73mb of poo poo that needed to come along with the exe so I gave up. I ended up doing a GoTo meeting which worked fine. But I'll check out PyInstaller. Thanks.

# ? Aug 16, 2012 22:48

Emacs Headroom: Aug 2, 2003

OnceIWasAnOstrich posted:

I'm trying to think of the most appropriate way to access the data as needed without loading everything into memory. I thought about sqlite but sql doesnt seem to have a way to easily deal with array-like data. Any suggestions?

You could use HDF5. If you're not building something you intend to generalize to other things though, you could also use split up your arrays into a series of separate files (like pickled numpy arrays, each containing 100k elements or something) and load them as needed?

# ? Aug 16, 2012 22:49

Cat Plus Plus: Apr 8, 2011

OnceIWasAnOstrich posted:

I'm trying to think of the most appropriate way to access the data as needed without loading everything into memory. I thought about sqlite but sql doesnt seem to have a way to easily deal with array-like data. Any suggestions?

PyTables might help.

# ? Aug 16, 2012 23:30

OnceIWasAnOstrich: Jul 22, 2006

PiotrLegnica posted:

PyTables might help.

This for HDF5 is probably what I will use, although what I've gathered from reading up on all this is I should really be using Numpy arrays instead of cPython lists. I don't know why but I have some sort of weird bias against Numpy.

# ? Aug 17, 2012 00:35

Chimp_On_Stilts: Aug 31, 2004; Holy Hell.

I'm trying to start using Tkinter and I'm off to a bad start. I can't import it.

Here's the error I get:

code:

Traceback (most recent call last):
  File "Foo.py", line 5, in <module>
    from Tkinter import *
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk/Tkinter.py", line 39, in <module>
    import _tkinter # If this fails your Python may not be configured for Tk
ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload/_tkinter.so, 2): no suitable image found.  Did find:
	/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload/_tkinter.so: no matching architecture in universal wrapper

I'm running MacOS 10.7. I've spent some time searching, but I haven't puzzled out a solution.

EDIT: I was running Python 2.7, upgrading to 2.7.3 fixed it if anybody wanted to know.

Chimp_On_Stilts fucked around with this message at 03:18 on Aug 19, 2012

# ? Aug 18, 2012 22:46

OnceIWasAnOstrich: Jul 22, 2006

OnceIWasAnOstrich posted:

This for HDF5 is probably what I will use, although what I've gathered from reading up on all this is I should really be using Numpy arrays instead of cPython lists. I don't know why but I have some sort of weird bias against Numpy.

While HDF5 and PyTables is actually pretty amazing for what I'm doing (and saves a shitton of memory and improves speed for a number of internal parts), there is no way in hell that most of the people who will want to use my program will be able to get HDF5 + PyTables installed correctly, especially on OSX.

On another note, I have discovered another reason for my dislike of numpy.

Python code:

In [155]: l=[randint(0,100) for _ in xrange(200)]

In [156]: a=numpy.array(l)

In [157]: timeit sum(l)
100000 loops, best of 3: 2.3 us per loop

In [158]: timeit sum(a)
10000 loops, best of 3: 111 us per loop

In [159]: timeit numpy.sum(a)
100000 loops, best of 3: 4.35 us per loop

Apparently cPython's optimization of list summation is better than numpy's optimized summation of arrays. When you have a script where ~80% of the execution time is tied up in integer list summation...welp.

# ? Aug 20, 2012 16:21

Emacs Headroom: Aug 2, 2003

OnceIWasAnOstrich posted:

Apparently cPython's optimization of list summation is better than numpy's optimized summation of arrays. When you have a script where ~80% of the execution time is tied up in integer list summation...welp.

Uh, that's pretty weird. Is your numpy install hosed somehow?

This is Python 2.7.3 installed through Macports:

Python code:

In [1]: l=[randint(0,100) for _ in xrange(200)]

In [2]: a=numpy.array(l)

In [3]: %timeit a.sum()
100000 loops, best of 3: 2.27 us per loop

In [4]: %timeit sum(a)
100000 loops, best of 3: 3.76 us per loop

In [5]: %timeit sum(l)
10000 loops, best of 3: 44.3 us per loop

In [6]: %timeit numpy.sum(l)
10000 loops, best of 3: 44.6 us per loop

# ? Aug 20, 2012 16:41

Cat Plus Plus: Apr 8, 2011

Apparently a.sum() and numpy.sum(a) aren't exactly the same for some reason. :iiam:

But yeah, don't use built-in sum for NumPy arrays, and you will benefit on large datasets anyway (because NP arrays are unboxed, so at the very least you'll be more cache-friendly).

For 200 elements I get (Py 2.7, Win7 x64, NP 1.6.1):

code:

In [7]: timeit sum(l)
100000 loops, best of 3: 2.53 us per loop

In [8]: timeit sum(a)
10000 loops, best of 3: 97.4 us per loop

In [9]: timeit a.sum()
1000000 loops, best of 3: 1.93 us per loop

In [10]: timeit numpy.sum(a)
100000 loops, best of 3: 2.82 us per loop

For 1000000 (and the difference between a.sum() and numpy.sum(a) is not visible; I don't know what that is about):

code:

In [23]: timeit sum(a)
1 loops, best of 3: 474 ms per loop

In [24]: timeit sum(l)
100 loops, best of 3: 11.9 ms per loop

In [25]: timeit a.sum()
1000 loops, best of 3: 1.38 ms per loop

In [26]: timeit numpy.sum(a)
1000 loops, best of 3: 1.38 ms per loop

# ? Aug 20, 2012 17:42

OnceIWasAnOstrich: Jul 22, 2006

Emacs Headroom posted:

Uh, that's pretty weird. Is your numpy install hosed somehow?

This is 100% possible and something I should probably investigate, although I have been way to busy to bother. I have a lot of trouble compiling any Python packages with gcc on 10.7. I probably horribly broke something somewhere since I was new to both OSX and Python.

Trying to install anything that needs to use gcc with distribute gives me this sort of error:

code:

lipo: can't figure out the architecture type of: /var/tmp//cch2WPt0.out

error: command 'gcc-4.2' failed with exit status 1

Anything non-Python related works great, compiles just fine with whatever the XCode 4 compiler is. The only thing I have found that fixes this is to replace the gcc-4.2 executable with the gcc-4.0 executable from XCode 3, then switch it back after I'm done. I imagine this could possibly break numpy..

I should probably just nuke my OS and start over.

# ? Aug 20, 2012 18:07

Sockser: Jun 28, 2007; This world only remembers the results!

I must be retarded or crazy. I've installed Python too many times on my machine and it's totally hosed me I think. Don't know if this is a Unix problem or a Python problem or both.

Setup Pygame on OSX Mountain Lion and everything. Friend started a code project, I grab his code. Guess I have all my modules installed to /usr/bin/python2.7. Cool.

So if I go into a terminal and type which python2.7 I get "/usr/bin/python2.7"

if I call python2.7 main.py I get an error about a missing module
if I call /usr/bin/python2.7 main.py it works just fine

do I have to gently caress with my bash profile or something?

# ? Aug 21, 2012 03:14

vikingstrike: Sep 23, 2007; whats happening, captain

Can you post the error? Have you hosed with your PYTHONPATH recently?

# ? Aug 21, 2012 03:19

Sockser: Jun 28, 2007; This world only remembers the results!

welp

added
export PYTHONPATH=~/Library/Frameworks:$PYTHONPATH
to my .bash_profile

now it works. Cool.

I really hate OSX changing directory structure every release.

# ? Aug 21, 2012 03:26

Emacs Headroom: Aug 2, 2003

I couldn't imagine wresting with the system Python in OSX. Macports might not do everything right, but it makes Python a lot less painful.

# ? Aug 21, 2012 03:31

vikingstrike: Sep 23, 2007; whats happening, captain

Sockser posted:

welp

added
export PYTHONPATH=~/Library/Frameworks:$PYTHONPATH
to my .bash_profile

now it works. Cool.

I really hate OSX changing directory structure every release.

It's annoying when you don't know about it and they change it suddenly, but I think moving that folder into the user's home folder is a smart move.

Although, like Emacs Headroom said, using Macports or Homebrew is usually a lot less painful.

# ? Aug 21, 2012 03:50

Modern Pragmatist: Aug 20, 2008

So I'm back with more python3 str/bytes stuff.

The basic problem that I'm trying to solve is this. I read some data in from a binary file (comes in as a bytestring); however, at the time of reading, I don't actually know the proper encoding. By default the files use 'iso8859', but if it's anything other than that, it is specified at the end of the file (annoying, I know). So basically I've tried to create a subclass of str that essentially stores the bytestring and then allows us to decode it later with the specified encoding if there is one.

The user only has to deal with unicode and never accesses this class directly. Instead, another part of my program creates a new String instance and can specify the encoding determined after reading so that it can be stored and the unicode can be properly encoded when I need to write it back to the file.

I hope that makes sense.

I've tried to use the unicode sandwich as much as possible so that the user only deals in unicode and the only time bytestrings are necessary is when reading/writing the data to a file.

If you could take a look at my code, I would really appreciate it. I may be over-complicating things a bit.

http://pastie.org/4562873

# ? Aug 21, 2012 17:04

Suspicious Dish: Sep 24, 2011; 2020 is the year of linux on the desktop, bro; Fun Shoe

Seek to the end of the file, read the encoding first?

# ? Aug 22, 2012 03:13

Captain Capacitor: Jan 21, 2008; The code you say?

What suspicious Dish said, or use something like mmap.

# ? Aug 22, 2012 05:35

Scaevolus: Apr 16, 2007

Modern Pragmatist posted:

http://pastie.org/4562873

When reading a file: Read the entire file into a bytes object. Read the encoding from the end, then return the file (minus encoding) decoded with that encoding.

When writing a file: Encode the data (unicode) using your specified encoding, yielding bytes. Append the encoding. Write it to the file.

# ? Aug 22, 2012 07:55

Dren: Jan 5, 2001; Pillbug

Modern Pragmatist posted:

So I'm back with more python3 str/bytes stuff.

The basic problem that I'm trying to solve is this. I read some data in from a binary file (comes in as a bytestring); however, at the time of reading, I don't actually know the proper encoding. By default the files use 'iso8859', but if it's anything other than that, it is specified at the end of the file (annoying, I know). So basically I've tried to create a subclass of str that essentially stores the bytestring and then allows us to decode it later with the specified encoding if there is one.

The user only has to deal with unicode and never accesses this class directly. Instead, another part of my program creates a new String instance and can specify the encoding determined after reading so that it can be stored and the unicode can be properly encoded when I need to write it back to the file.

I hope that makes sense.

I've tried to use the unicode sandwich as much as possible so that the user only deals in unicode and the only time bytestrings are necessary is when reading/writing the data to a file.

If you could take a look at my code, I would really appreciate it. I may be over-complicating things a bit.

http://pastie.org/4562873

Is the data ever stored as Unicode using the BOM, which must be at the beginning of the file?

What about large files? If there is a 2 Gb file you're going to suck a ton of memory to read it in as a bytestring then create a decoded copy of it.

I haven't used Python 3 but I read some documentation on it:
the unicode howto - http://docs.python.org/release/3.0.1/howto/unicode.html
the open function - http://docs.python.org/release/3.0.1/library/functions.html#open

Python 3's open function allows you to specify an encoding for reading. If you were to specify the proper encoding when you opened the file you would never have to worry with bytestrings, encoding, or decoding. You would also not be required to load the entire contents of the file into memory in order to properly decode them (which means you could work with large files).

I recommend that you seek to the end of the file where the encoding specification is, read the encoding, then re-open the file in the proper encoding. You can/should get rid of your intermediate string class. Python 3's open method will do all of the work for you.

# ? Aug 22, 2012 22:02

Cat Plus Plus: Apr 8, 2011

Dren posted:

Python 3's open function allows you to specify an encoding for reading. If you were to specify the proper encoding when you opened the file you would never have to worry with bytestrings, encoding, or decoding. You would also not be required to load the entire contents of the file into memory in order to properly decode them (which means you could work with large files).

It's there in Python 2, too. codecs.open or io.open (but that's usable only in 2.7 � 2.6 had lovely implementation of io).

# ? Aug 22, 2012 23:44

Modern Pragmatist: Aug 20, 2008

Dren posted:

Is the data ever stored as Unicode using the BOM, which must be at the beginning of the file?

What about large files? If there is a 2 Gb file you're going to suck a ton of memory to read it in as a bytestring then create a decoded copy of it.

I haven't used Python 3 but I read some documentation on it:
the unicode howto - http://docs.python.org/release/3.0.1/howto/unicode.html
the open function - http://docs.python.org/release/3.0.1/library/functions.html#open

Python 3's open function allows you to specify an encoding for reading. If you were to specify the proper encoding when you opened the file you would never have to worry with bytestrings, encoding, or decoding. You would also not be required to load the entire contents of the file into memory in order to properly decode them (which means you could work with large files).

I recommend that you seek to the end of the file where the encoding specification is, read the encoding, then re-open the file in the proper encoding. You can/should get rid of your intermediate string class. Python 3's open method will do all of the work for you.

So the files I'm dealing with are a little complicated. Basically, you have the following binary format:

<ID><Length of Data><Datatype><Data><ID><Length of Data><Datatype><Data>...

And this repeats for ~100 data fields. The encoding is actually contained in one of these fields. Also, only 4 of the 15 or so possible Datatypes are encoded using the specified encoding. The others, are encoded using iso8859. Because of this mix of encodings, I don't think setting the encoding for open would work out.

The field that contains the encoding to be used is approximately the 25th sequential field in the file but isn't consistent (not easy to seek for). That's why my initial thought was to store all the bytestrings and then decode them all after I've read in all fields (including the encoding). It wouldn't be so bad if the encoding field always occurred before any fields that required it, but that is rarely the case.

I can try to do something similar to what Suspicious Dish suggested but I'll have to try to figure out how to work that into the existing code. As it is, most fields are oblivious of the other fields, so it's difficult for each field to check for a specific encoding during file reading.

# ? Aug 23, 2012 02:57

Suspicious Dish: Sep 24, 2011; 2020 is the year of linux on the desktop, bro; Fun Shoe

Read as bytestrings, and decode after you read something. Don't create a subclass of str, that's a really terrible idea.

# ? Aug 23, 2012 03:10

Adbot: ADBOT LOVES YOU

# ? May 20, 2024 00:08

Dren: Jan 5, 2001; Pillbug

Ok so your files have a record format,

<ID><Length of Data><Datatype><Data>

Each file consists of a collection of these records. How many datatypes are there? It sounds like some are metadata, for instance the record that tells the encoding of other types. I imagine that some of the datatypes have data that is of a known encoding or is an integer or something that doesn't need to be decoded. Furthermore, I expect that the ID, length, and datatype fields are fixed width fields.

I suggest that you create a class to model the record. Something like this, apologies if I'm using Python 2.x style:

code:

import struct
import codecs

class Record:
  ValidTypes = {
    1 : 'ENCODING',
    2 : 'ISO8859_FIELD',
    3 : 'UNKNOWN_ENC_FIELD',
  }

  @classmethod
  def read(cls, fd):
    buff = fd.read(12)  # read 12 bytes, I'm assuming <ID><Length><Datatype> takes 12 bytes
    if len(buff) != 12:
      return None
    
    id, length, datatype = struct.unpack("III", buff)

    # I'm trusting that your data is good, you might want to do something like validate a reasonable
    # length
    data = fd.read(length)
    if len(data) != length:
      return None
    
    return Record(id, ValidTypes[datatype], data)    

  def __init__(self, id, datatype, data):
    self.id = id
    self.datatype = datatype
    self.data = data

  def decode(self, encoding):
    # You might want to make it possible to decode more than once, or make it an error.  This implementation will create problems if you try to decode more than once.
    decoder = codecs.getdecoder(encoding)
    self.data = decoder.decode(self.data)

def main:
  # TODO: add error handling for io problems
  fd = open('datafile.dat', 'rb')
  records = []
  rec = Record.read(fd)
  while rec is not None:
    records.append(rec)
    rec = Record.read(fd)

  fd.close()
  
  # sort your records and find the one with the encoding in it
  encoding = ...

  # decode the records that need to be decoded
  todecode = filter(lambda x : x.datatype == 'UNKNOWN_ENC_FIELD', records)
  map(lambda x : x.decode(encoding), todecode)
  
  # present data to user

This is an outline of what you need. I put the dictionary/enum in there as an example if you feel that something like that would be convenient. You can pass around integer values or whatever is actually in the field if you want.

My read methodology and while loop could probably be more pythonic, I've been writing C for a while. You should probably use the with syntax for the open. Handle IO errors in the scope of the file open, not inside the Record class. In the Record class it'd be a good idea to throw an error if you see an incomplete record or an enum value of an unknown type. You might want to store the records in something other than a list for access purposes. You might also want to implement some comparator operators and @total_ordering in the Record class so that you can easily sort it with sorted(). http://docs.python.org/py3k/library/functools.html?highlight=functools#functools

If you have records of a type other than string, you could rework decode to properly decode them and call decode on everything.

# ? Aug 23, 2012 18:25

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›484 »