Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
N.Z.'s Champion
Jun 8, 2003

Yam Slacker
After hating on the XML processing in Python (4suite is sluggish and buggy) I found LXML and P4X. So use them unless you want to break your brain.

N.Z.'s Champion fucked around with this message at 00:30 on Nov 5, 2007

Adbot
ADBOT LOVES YOU

N.Z.'s Champion
Jun 8, 2003

Yam Slacker

m0nk3yz posted:

Which version of Python were you using?

2.4 and 2.5, but that doesn't really change the crappiness of 4suite xml processing (which is all I'm trying to warn people off).

N.Z.'s Champion
Jun 8, 2003

Yam Slacker

devilmouse posted:

Ugh, can't believe I'm asking this... but does anyone know any modules to go about extracting data/images from Powerpoint (ppt/pptx) files? The closest thing I've found uses ActivePython/Win32 COM stuff, but that's not so helpful when me and the server are running OS X and Linux.
PPTX files are ZIP so you can iterate through them looking for images.

PPT files vary so much that only an office suite will be able to deal with all the vagaries of the format. You can use OpenOffice in a server-mode (headless, no X server) and stream documents to it with PyODConverter. If you're on Debian you can install OpenOffice in a server mode by apt-get installing docvert-openoffice.org (I make docvert)

If by "data" you mean the slides and text, then I suggest converting PPTX to ODP (OpenDocument Presentation) with OpenOffice and parsing that format because it's considerably more sane than PPTX. ODP files are also ZIPs of XML and binaries, and you could use lxml (either conventional node iteration or perhaps XSL-T) to extract the useful parts.

N.Z.'s Champion fucked around with this message at 10:39 on Oct 26, 2010

N.Z.'s Champion
Jun 8, 2003

Yam Slacker
Python projects go in this thread right? For the last few months I've been busy porting Docvert from PHP to Python and what the software does is convert Office files to Docbook and clean HTML, lists are made hierarchical , any vector diagrams are converted to SVG/PNG, and so on. It needs pyuno/libreoffice but the conversion from OpenDocument to DocBook and HTML is done in Docvert.

N.Z.'s Champion fucked around with this message at 23:48 on Mar 30, 2011

  • Locked thread