Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
leterip
Aug 25, 2004

tef posted:

javascript/json only has double precision floating point

That's just javascript. json as a spec allows arbitrary sized integers/decimals.

Adbot
ADBOT LOVES YOU

tef
May 30, 2004

-> some l-system crap ->

leterip posted:

That's just javascript. json as a spec allows arbitrary sized integers/decimals.

Javascript and most parsers that i've encountered. Json is designed to be a subset of Javascript.

That is why we have things like this

quote:

To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".

Cat Plus Plus
Apr 8, 2011

:frogc00l:

The King of Swag posted:

I've seen that before and the problem is that I need to save my data out as binary and be able to easily read it all back.

I second Protobuf for this. Makes future/backwards compatibility a lot easier, and sooner or later you will change the format and will have to load older files. Plus, language agnostic and portable.

JSON is fine if you don't want a binary file format.

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
Python's JSON parser distinguishes between ints and floats based on the existence of a decimal literal.

http://hg.python.org/cpython/file/5395f96588d4/Lib/json/scanner.py#l48

tef
May 30, 2004

-> some l-system crap ->
It also handles NaN, and +/- Infinity. Which is excluded from the spec

quote:

Numeric values that cannot be represented as sequences of digits
(such as Infinity and NaN) are not permitted.

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
Sure, but that says nothing about the validity of the way Python's JSON scanner interprets number literals.

There's nothing explicit, in either the JSON RFC or otherwise that all numbers should be within the representable range of IEEE754 double-precision floats.

tef
May 30, 2004

-> some l-system crap ->
I just don't like JSON :argh:

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
Sure. You don't have to like it. That's your opinion, and I respect it.

But don't make up incorrect facts that are verifiably false. That's rude.

tef
May 30, 2004

-> some l-system crap ->
Sorry, I just assumed that the design criteria of JSON, being a subset of Javascript, implied that the numbers were doubles, and the strings didn't handle things outside the BMP. The latter is explicit, the former is just my faulty recollection :3:

Also, you smell.

ufarn
May 30, 2009
How would I go about writing a family tree in Python, and what charting library should I use for it?

Cat Plus Plus
Apr 8, 2011

:frogc00l:

ufarn posted:

How would I go about writing a family tree in Python, and what charting library should I use for it?

I'd pick a graph library (like igraph or NetworkX), they usually come with drawing or at least Graphviz export.

Sub Par
Jul 18, 2001


Dinosaur Gum
I have what may be a dumb question. I have a python script that is a small command line app. It's a proof of concept for what may end up being a more fully-featured command-line app. I need to demonstrate this to someone who is not technically savvy enough to install python and whatnot, and I can't just email them an exe.

I don't know much about web stuff but is there some website where I can upload the .py and have it run the thing in a browser so I can send this person to, e.g. http://myscript.idleonline.com or something?

Cat Plus Plus
Apr 8, 2011

:frogc00l:
Ideone can run both Python 2 and 3 (and more). It's really handy.

Sub Par
Jul 18, 2001


Dinosaur Gum
That is handy, thanks. But I need it to be interactive. It's along the lines of:
code:
while True:
        user_input = raw_input('Type Stuff -> ')
        if user_input in ('exit','quit'): break
	print do_awesome_stuff(user_input)

Cat Plus Plus
Apr 8, 2011

:frogc00l:
You can put the input sequence in "stdin" field, and it'll work.
Alternatively, you can try repl.it, which actually runs the interpreter compiled to JavaScript inside the browser.

evilentity
Jun 25, 2010
Or you could send the exe :v:

http://www.pyinstaller.org/

Works for me.

OnceIWasAnOstrich
Jul 22, 2006

I am writing a webapp (using CherryPy) for visualization of some data sets. This isnt going to be on the internet, just run on the users computer. Currently I am loading the data (1-6+ integer lists 3m+ elements long) into memory. The the client POSTs and CherryPy responds with a json dump of the slice of the datasets that was requested. This works great because I only have one dataset and an obscene amount of RAM, both things that users may lack.

I'm trying to think of the most appropriate way to access the data as needed without loading everything into memory. I thought about sqlite but sql doesnt seem to have a way to easily deal with array-like data. Any suggestions?

Sub Par
Jul 18, 2001


Dinosaur Gum

evilentity posted:

Or you could send the exe :v:

http://www.pyinstaller.org/

Works for me.

I considered that but py2exe was building 73mb of poo poo that needed to come along with the exe so I gave up. I ended up doing a GoTo meeting which worked fine. But I'll check out PyInstaller. Thanks.

Emacs Headroom
Aug 2, 2003

OnceIWasAnOstrich posted:

I'm trying to think of the most appropriate way to access the data as needed without loading everything into memory. I thought about sqlite but sql doesnt seem to have a way to easily deal with array-like data. Any suggestions?

You could use HDF5. If you're not building something you intend to generalize to other things though, you could also use split up your arrays into a series of separate files (like pickled numpy arrays, each containing 100k elements or something) and load them as needed?

Cat Plus Plus
Apr 8, 2011

:frogc00l:

OnceIWasAnOstrich posted:

I'm trying to think of the most appropriate way to access the data as needed without loading everything into memory. I thought about sqlite but sql doesnt seem to have a way to easily deal with array-like data. Any suggestions?

PyTables might help.

OnceIWasAnOstrich
Jul 22, 2006


This for HDF5 is probably what I will use, although what I've gathered from reading up on all this is I should really be using Numpy arrays instead of cPython lists. I don't know why but I have some sort of weird bias against Numpy.

Chimp_On_Stilts
Aug 31, 2004
Holy Hell.
I'm trying to start using Tkinter and I'm off to a bad start. I can't import it.

Here's the error I get:

code:
Traceback (most recent call last):
  File "Foo.py", line 5, in <module>
    from Tkinter import *
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk/Tkinter.py", line 39, in <module>
    import _tkinter # If this fails your Python may not be configured for Tk
ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload/_tkinter.so, 2): no suitable image found.  Did find:
	/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload/_tkinter.so: no matching architecture in universal wrapper
I'm running MacOS 10.7. I've spent some time searching, but I haven't puzzled out a solution.

EDIT: I was running Python 2.7, upgrading to 2.7.3 fixed it if anybody wanted to know.

Chimp_On_Stilts fucked around with this message at 03:18 on Aug 19, 2012

OnceIWasAnOstrich
Jul 22, 2006

OnceIWasAnOstrich posted:

This for HDF5 is probably what I will use, although what I've gathered from reading up on all this is I should really be using Numpy arrays instead of cPython lists. I don't know why but I have some sort of weird bias against Numpy.

While HDF5 and PyTables is actually pretty amazing for what I'm doing (and saves a shitton of memory and improves speed for a number of internal parts), there is no way in hell that most of the people who will want to use my program will be able to get HDF5 + PyTables installed correctly, especially on OSX.

On another note, I have discovered another reason for my dislike of numpy.

Python code:
In [155]: l=[randint(0,100) for _ in xrange(200)]

In [156]: a=numpy.array(l)

In [157]: timeit sum(l)
100000 loops, best of 3: 2.3 us per loop

In [158]: timeit sum(a)
10000 loops, best of 3: 111 us per loop

In [159]: timeit numpy.sum(a)
100000 loops, best of 3: 4.35 us per loop
Apparently cPython's optimization of list summation is better than numpy's optimized summation of arrays. When you have a script where ~80% of the execution time is tied up in integer list summation...welp.

Emacs Headroom
Aug 2, 2003

OnceIWasAnOstrich posted:

Apparently cPython's optimization of list summation is better than numpy's optimized summation of arrays. When you have a script where ~80% of the execution time is tied up in integer list summation...welp.
Uh, that's pretty weird. Is your numpy install hosed somehow?

This is Python 2.7.3 installed through Macports:

Python code:
In [1]: l=[randint(0,100) for _ in xrange(200)]

In [2]: a=numpy.array(l)

In [3]: %timeit a.sum()
100000 loops, best of 3: 2.27 us per loop

In [4]: %timeit sum(a)
100000 loops, best of 3: 3.76 us per loop

In [5]: %timeit sum(l)
10000 loops, best of 3: 44.3 us per loop

In [6]: %timeit numpy.sum(l)
10000 loops, best of 3: 44.6 us per loop

Cat Plus Plus
Apr 8, 2011

:frogc00l:
Apparently a.sum() and numpy.sum(a) aren't exactly the same for some reason. :iiam:
But yeah, don't use built-in sum for NumPy arrays, and you will benefit on large datasets anyway (because NP arrays are unboxed, so at the very least you'll be more cache-friendly).

For 200 elements I get (Py 2.7, Win7 x64, NP 1.6.1):

code:
In [7]: timeit sum(l)
100000 loops, best of 3: 2.53 us per loop

In [8]: timeit sum(a)
10000 loops, best of 3: 97.4 us per loop

In [9]: timeit a.sum()
1000000 loops, best of 3: 1.93 us per loop

In [10]: timeit numpy.sum(a)
100000 loops, best of 3: 2.82 us per loop
For 1000000 (and the difference between a.sum() and numpy.sum(a) is not visible; I don't know what that is about):

code:
In [23]: timeit sum(a)
1 loops, best of 3: 474 ms per loop

In [24]: timeit sum(l)
100 loops, best of 3: 11.9 ms per loop

In [25]: timeit a.sum()
1000 loops, best of 3: 1.38 ms per loop

In [26]: timeit numpy.sum(a)
1000 loops, best of 3: 1.38 ms per loop

OnceIWasAnOstrich
Jul 22, 2006

Emacs Headroom posted:

Uh, that's pretty weird. Is your numpy install hosed somehow?


This is 100% possible and something I should probably investigate, although I have been way to busy to bother. I have a lot of trouble compiling any Python packages with gcc on 10.7. I probably horribly broke something somewhere since I was new to both OSX and Python.

Trying to install anything that needs to use gcc with distribute gives me this sort of error:

code:
lipo: can't figure out the architecture type of: /var/tmp//cch2WPt0.out

error: command 'gcc-4.2' failed with exit status 1
Anything non-Python related works great, compiles just fine with whatever the XCode 4 compiler is. The only thing I have found that fixes this is to replace the gcc-4.2 executable with the gcc-4.0 executable from XCode 3, then switch it back after I'm done. I imagine this could possibly break numpy..

I should probably just nuke my OS and start over.

Sockser
Jun 28, 2007

This world only remembers the results!




I must be retarded or crazy. I've installed Python too many times on my machine and it's totally hosed me I think. Don't know if this is a Unix problem or a Python problem or both.

Setup Pygame on OSX Mountain Lion and everything. Friend started a code project, I grab his code. Guess I have all my modules installed to /usr/bin/python2.7. Cool.

So if I go into a terminal and type which python2.7 I get "/usr/bin/python2.7"

if I call python2.7 main.py I get an error about a missing module
if I call /usr/bin/python2.7 main.py it works just fine

do I have to gently caress with my bash profile or something?

vikingstrike
Sep 23, 2007

whats happening, captain
Can you post the error? Have you hosed with your PYTHONPATH recently?

Sockser
Jun 28, 2007

This world only remembers the results!




welp

added
export PYTHONPATH=~/Library/Frameworks:$PYTHONPATH
to my .bash_profile

now it works. Cool.

I really hate OSX changing directory structure every release.

Emacs Headroom
Aug 2, 2003
I couldn't imagine wresting with the system Python in OSX. Macports might not do everything right, but it makes Python a lot less painful.

vikingstrike
Sep 23, 2007

whats happening, captain

Sockser posted:

welp

added
export PYTHONPATH=~/Library/Frameworks:$PYTHONPATH
to my .bash_profile

now it works. Cool.

I really hate OSX changing directory structure every release.

It's annoying when you don't know about it and they change it suddenly, but I think moving that folder into the user's home folder is a smart move.

Although, like Emacs Headroom said, using Macports or Homebrew is usually a lot less painful.

Modern Pragmatist
Aug 20, 2008
So I'm back with more python3 str/bytes stuff.

The basic problem that I'm trying to solve is this. I read some data in from a binary file (comes in as a bytestring); however, at the time of reading, I don't actually know the proper encoding. By default the files use 'iso8859', but if it's anything other than that, it is specified at the end of the file (annoying, I know). So basically I've tried to create a subclass of str that essentially stores the bytestring and then allows us to decode it later with the specified encoding if there is one.

The user only has to deal with unicode and never accesses this class directly. Instead, another part of my program creates a new String instance and can specify the encoding determined after reading so that it can be stored and the unicode can be properly encoded when I need to write it back to the file.

I hope that makes sense.

I've tried to use the unicode sandwich as much as possible so that the user only deals in unicode and the only time bytestrings are necessary is when reading/writing the data to a file.

If you could take a look at my code, I would really appreciate it. I may be over-complicating things a bit.

http://pastie.org/4562873

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
Seek to the end of the file, read the encoding first?

Captain Capacitor
Jan 21, 2008

The code you say?
What suspicious Dish said, or use something like mmap.

Scaevolus
Apr 16, 2007

When reading a file: Read the entire file into a bytes object. Read the encoding from the end, then return the file (minus encoding) decoded with that encoding.

When writing a file: Encode the data (unicode) using your specified encoding, yielding bytes. Append the encoding. Write it to the file.

Dren
Jan 5, 2001

Pillbug

Modern Pragmatist posted:

So I'm back with more python3 str/bytes stuff.

The basic problem that I'm trying to solve is this. I read some data in from a binary file (comes in as a bytestring); however, at the time of reading, I don't actually know the proper encoding. By default the files use 'iso8859', but if it's anything other than that, it is specified at the end of the file (annoying, I know). So basically I've tried to create a subclass of str that essentially stores the bytestring and then allows us to decode it later with the specified encoding if there is one.

The user only has to deal with unicode and never accesses this class directly. Instead, another part of my program creates a new String instance and can specify the encoding determined after reading so that it can be stored and the unicode can be properly encoded when I need to write it back to the file.

I hope that makes sense.

I've tried to use the unicode sandwich as much as possible so that the user only deals in unicode and the only time bytestrings are necessary is when reading/writing the data to a file.

If you could take a look at my code, I would really appreciate it. I may be over-complicating things a bit.

http://pastie.org/4562873

Is the data ever stored as Unicode using the BOM, which must be at the beginning of the file?

What about large files? If there is a 2 Gb file you're going to suck a ton of memory to read it in as a bytestring then create a decoded copy of it.

I haven't used Python 3 but I read some documentation on it:
the unicode howto - http://docs.python.org/release/3.0.1/howto/unicode.html
the open function - http://docs.python.org/release/3.0.1/library/functions.html#open

Python 3's open function allows you to specify an encoding for reading. If you were to specify the proper encoding when you opened the file you would never have to worry with bytestrings, encoding, or decoding. You would also not be required to load the entire contents of the file into memory in order to properly decode them (which means you could work with large files).

I recommend that you seek to the end of the file where the encoding specification is, read the encoding, then re-open the file in the proper encoding. You can/should get rid of your intermediate string class. Python 3's open method will do all of the work for you.

Cat Plus Plus
Apr 8, 2011

:frogc00l:

Dren posted:

Python 3's open function allows you to specify an encoding for reading. If you were to specify the proper encoding when you opened the file you would never have to worry with bytestrings, encoding, or decoding. You would also not be required to load the entire contents of the file into memory in order to properly decode them (which means you could work with large files).

It's there in Python 2, too. codecs.open or io.open (but that's usable only in 2.7 — 2.6 had lovely implementation of io).

Modern Pragmatist
Aug 20, 2008

Dren posted:

Is the data ever stored as Unicode using the BOM, which must be at the beginning of the file?

What about large files? If there is a 2 Gb file you're going to suck a ton of memory to read it in as a bytestring then create a decoded copy of it.

I haven't used Python 3 but I read some documentation on it:
the unicode howto - http://docs.python.org/release/3.0.1/howto/unicode.html
the open function - http://docs.python.org/release/3.0.1/library/functions.html#open

Python 3's open function allows you to specify an encoding for reading. If you were to specify the proper encoding when you opened the file you would never have to worry with bytestrings, encoding, or decoding. You would also not be required to load the entire contents of the file into memory in order to properly decode them (which means you could work with large files).

I recommend that you seek to the end of the file where the encoding specification is, read the encoding, then re-open the file in the proper encoding. You can/should get rid of your intermediate string class. Python 3's open method will do all of the work for you.

So the files I'm dealing with are a little complicated. Basically, you have the following binary format:

<ID><Length of Data><Datatype><Data><ID><Length of Data><Datatype><Data>...

And this repeats for ~100 data fields. The encoding is actually contained in one of these fields. Also, only 4 of the 15 or so possible Datatypes are encoded using the specified encoding. The others, are encoded using iso8859. Because of this mix of encodings, I don't think setting the encoding for open would work out.

The field that contains the encoding to be used is approximately the 25th sequential field in the file but isn't consistent (not easy to seek for). That's why my initial thought was to store all the bytestrings and then decode them all after I've read in all fields (including the encoding). It wouldn't be so bad if the encoding field always occurred before any fields that required it, but that is rarely the case.

I can try to do something similar to what Suspicious Dish suggested but I'll have to try to figure out how to work that into the existing code. As it is, most fields are oblivious of the other fields, so it's difficult for each field to check for a specific encoding during file reading.

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
Read as bytestrings, and decode after you read something. Don't create a subclass of str, that's a really terrible idea.

Adbot
ADBOT LOVES YOU

Dren
Jan 5, 2001

Pillbug
Ok so your files have a record format,

<ID><Length of Data><Datatype><Data>

Each file consists of a collection of these records. How many datatypes are there? It sounds like some are metadata, for instance the record that tells the encoding of other types. I imagine that some of the datatypes have data that is of a known encoding or is an integer or something that doesn't need to be decoded. Furthermore, I expect that the ID, length, and datatype fields are fixed width fields.

I suggest that you create a class to model the record. Something like this, apologies if I'm using Python 2.x style:
code:
import struct
import codecs

class Record:
  ValidTypes = {
    1 : 'ENCODING',
    2 : 'ISO8859_FIELD',
    3 : 'UNKNOWN_ENC_FIELD',
  }

  @classmethod
  def read(cls, fd):
    buff = fd.read(12)  # read 12 bytes, I'm assuming <ID><Length><Datatype> takes 12 bytes
    if len(buff) != 12:
      return None
    
    id, length, datatype = struct.unpack("III", buff)

    # I'm trusting that your data is good, you might want to do something like validate a reasonable
    # length
    data = fd.read(length)
    if len(data) != length:
      return None
    
    return Record(id, ValidTypes[datatype], data)    

  def __init__(self, id, datatype, data):
    self.id = id
    self.datatype = datatype
    self.data = data

  def decode(self, encoding):
    # You might want to make it possible to decode more than once, or make it an error.  This implementation will create problems if you try to decode more than once.
    decoder = codecs.getdecoder(encoding)
    self.data = decoder.decode(self.data)

def main:
  # TODO: add error handling for io problems
  fd = open('datafile.dat', 'rb')
  records = []
  rec = Record.read(fd)
  while rec is not None:
    records.append(rec)
    rec = Record.read(fd)

  fd.close()
  
  # sort your records and find the one with the encoding in it
  encoding = ...

  # decode the records that need to be decoded
  todecode = filter(lambda x : x.datatype == 'UNKNOWN_ENC_FIELD', records)
  map(lambda x : x.decode(encoding), todecode)
  
  # present data to user
This is an outline of what you need. I put the dictionary/enum in there as an example if you feel that something like that would be convenient. You can pass around integer values or whatever is actually in the field if you want.

My read methodology and while loop could probably be more pythonic, I've been writing C for a while. You should probably use the with syntax for the open. Handle IO errors in the scope of the file open, not inside the Record class. In the Record class it'd be a good idea to throw an error if you see an incomplete record or an enum value of an unknown type. You might want to store the records in something other than a list for access purposes. You might also want to implement some comparator operators and @total_ordering in the Record class so that you can easily sort it with sorted(). http://docs.python.org/py3k/library/functools.html?highlight=functools#functools

If you have records of a type other than string, you could rework decode to properly decode them and call decode on everything.

  • Locked thread