Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

Asymmetrikon posted:

I don't know how much of Python's standard types are defined by spec or left up to implementation; wouldn't it be technically possible for list to be backed as a linked list?

Sure. It's also possible for every integer to be a linked list where the value it represents is the length of said list. Both are equally silly. Python programs basically expect O(1) access and a linked list doesn't really help you there a lot.

Adbot
ADBOT LOVES YOU

Dominoes
Sep 20, 2007

Hey dudes; trying to run a java file from Python, using subprocess. It's not doing anything. The most basic example:

Python code:
subprocess.run([java], shell=True, check=True)
This should show the java help page, but is returning CalledProcessError: Command '['java']' returned non-zero exit status 1. The docs mention this, but I still don't get it. Without check=True, it doesn't do anything. The actual code I'm trying to run looks like this:

Python code:
subprocess.run(['java', '-jar', 'mathtoweb.jar'], shell=True)

Eggnogium
Jun 1, 2010

Never give an inch! Hnnnghhhhhh!

Dominoes posted:

Hey dudes; trying to run a java file from Python, using subprocess. It's not doing anything. The most basic example:

Python code:
subprocess.run([java], shell=True, check=True)
This should show the java help page, but is returning CalledProcessError: Command '['java']' returned non-zero exit status 1. The docs mention this, but I still don't get it. Without check=True, it doesn't do anything. The actual code I'm trying to run looks like this:

Python code:
subprocess.run(['java', '-jar', 'mathtoweb.jar'], shell=True)

Well java.exe with no params does return 1, so that's why it's throwing an exception when check=True. To actually get the output from java you have to pipe it as part of the call to subprocess.run. There are many ways to do this, depending on what you want to do with that output, but if you search "subprocess pipe output" you should find some useful info.

QuarkJets
Sep 8, 2008

Dominoes posted:

Hey dudes; trying to run a java file from Python, using subprocess. It's not doing anything. The most basic example:

Python code:
subprocess.run([java], shell=True, check=True)
This should show the java help page, but is returning CalledProcessError: Command '['java']' returned non-zero exit status 1. The docs mention this, but I still don't get it. Without check=True, it doesn't do anything. The actual code I'm trying to run looks like this:

Python code:
subprocess.run(['java', '-jar', 'mathtoweb.jar'], shell=True)

You don't need shell=True if you're passing a list like that

Python code:
proc = subprocess.Popen(['java', '-jar', 'mathtoweb.jar'])
proc.wait()
I've never used subprocess.run so I don't know if it's necessary to call wait() or communicate() or something similar. If it's returning a Popen object, then depending on how your code is setup you may need to call the wait() method (or similar) to ensure that execution finishes when you want it to finish. If the subprocess is still running when your script closes, then the subprocess will be killed (in my experience; behavior may have changed in later versions), so simply calling wait() will ensure that execution proceeds normally.

What happens if you do the above without shell=True or check=True? Do you get any error messages?

Sidenote, I love subprocess; it's like a poor man's parallel processing module and lets you do all sorts of crazy stuff. But it's also sometimes a messy crutch. In your case, you want to invoke a jar; have you considered installing a module like Pyjnius or py4J in order to invoke Java classes directly?

Dex
May 26, 2006

Quintuple x!!!

Would not escrow again.

VERY MISLEADING!
Popen works as a context manager, fyi, you can use "with Popen(blah) as proc:" to get your waiting and file descriptor closing and so on for free

Dominoes
Sep 20, 2007

Thansk for the subprocess words. Got it sorting using that advice!

accipter
Sep 12, 2003
I would like to use Bokeh to plot data, and then select a portion of that data to work with. I don't understand how to get the data or the region back from the selector tools. Do I need to use a callback? Thanks.

StormyDragon
Jun 3, 2011

accipter posted:

I would like to use Bokeh to plot data, and then select a portion of that data to work with. I don't understand how to get the data or the region back from the selector tools. Do I need to use a callback? Thanks.

The relevant documentation appears to be listed here: CustomJS for Selections

accipter
Sep 12, 2003

StormyDragon posted:

The relevant documentation appears to be listed here: CustomJS for Selections

Thanks. I found this, but then I thought there should be a more direct way to get access to the data, but I couldn't find it.

StormyDragon
Jun 3, 2011

accipter posted:

Thanks. I found this, but then I thought there should be a more direct way to get access to the data, but I couldn't find it.

I think you will have to put a few more words into your exact use case, by the sound of it I can only surmise that you need to use the Bokeh Server in order to get python callbacks.

accipter
Sep 12, 2003

StormyDragon posted:

I think you will have to put a few more words into your exact use case, by the sound of it I can only surmise that you need to use the Bokeh Server in order to get python callbacks.

Fair enough. I am looking at porting an application that uses matplotlib to interactively select data to a Jupyter Notebook. It was my understanding that matplotlib couldn't be interacted with in a notebook. I was thinking that bokeh could be a potential solution. A cell would be used to create the plot and interactively select the data, and then the following cell would retrieve the selected region from the chart object.

creatine
Jan 27, 2012




Question: I am using Tkinter and Tweepy to build a simple python twitter gui. I have an entry box and button that I can send tweets from (works fine).

I then have a text box and two buttons for it. One button should start a twitter stream and print new tweets as they appear into the text box. The other button should stop the stream.

I have my Tkinter GUI set up as one class and the twitter portion set up as another. Upon starting the stream I can get my script to print the tweets to the console but I can't figure out if there is a way to write them to the tkinter text box.

I put the entire script on pastebin here :http://pastebin.com/S6sK92Hh

the settings import is just a file that looks like this:
code:
settings = {
	'consumer_key' : '',
	'consumer_secret' : '',
	'access_key' : '',
	'access_secret' : ''
}
 
I'm new to python, I've been messing with it on/off for maybe a couple months. I don't even know if what I'm doing is possible or not. But I appreciate any advice.

edit: I think I may have found a way to do this using tk.Frame that I was not utilizing earlier. Will update if I get it to work.

2edit: Rewrote a bunch http://pastebin.com/HEdzvJM9 still can't get it to write new tweets to textbox

creatine fucked around with this message at 04:14 on Jan 19, 2017

Whybird
Aug 2, 2009

Phaiston have long avoided the tightly competetive defence sector, but the IRDA Act 2052 has given us the freedom we need to bring out something really special.

https://team-robostar.itch.io/robostar


Nap Ghost
I'm picking up Python and I have a question about data structures, and which one's best.

In my code I'm tracking which players have sight of which entities. If a player moves, I'll need to recalculate which entities they have visibility to; if an entity moves, I'll need to recalculate which players have visibility of it.

I'll also need to iterate over which players can see an entity (for when the entity does something that players should find out about) and over which entities a player can see (for the player's UI).

There will generally be quite a lot more entities than players, and a lot of them will not be seen by any player at any time.

My current approach is for my "visibility" class to have two dictionaries: one with player ID as the key, and each entry being a set of all visible entities; the other with entity ID as the key and each entry a set of players who have visibility of the entity. Whenever the data changes, the class will have to update both dictionaries, but having the data held in this way will presumably mean that I don't have to search through the entirety of the visibility data each time I need to iterate through one player's visible entities or one entity's viewing players.

Is the approach I'm describing reasonable, or is there something I've misunderstood about how fast different data structures are?

Cingulate
Oct 23, 2012

by Fluffdaddy

Whybird posted:

I'm picking up Python and I have a question about data structures, and which one's best.

In my code I'm tracking which players have sight of which entities. If a player moves, I'll need to recalculate which entities they have visibility to; if an entity moves, I'll need to recalculate which players have visibility of it.

I'll also need to iterate over which players can see an entity (for when the entity does something that players should find out about) and over which entities a player can see (for the player's UI).

There will generally be quite a lot more entities than players, and a lot of them will not be seen by any player at any time.

My current approach is for my "visibility" class to have two dictionaries: one with player ID as the key, and each entry being a set of all visible entities; the other with entity ID as the key and each entry a set of players who have visibility of the entity. Whenever the data changes, the class will have to update both dictionaries, but having the data held in this way will presumably mean that I don't have to search through the entirety of the visibility data each time I need to iterate through one player's visible entities or one entity's viewing players.

Is the approach I'm describing reasonable, or is there something I've misunderstood about how fast different data structures are?
As your data is rectangular, sounds like a (sparse) array or a Pandas Dataframe might be best?

Space Kablooey
May 6, 2009


Dictionaries keep their keys as hashes, so accessing different sets shouldn't be too bad, and iterating through sets should also be fast enough.

However, that architecture seems a bit strange to me. Why can't you keep a list of visible entities on each instance instead of having this Visibility class? Presumably, you are already holding a collection of players instances and entities instances around, and having this visibility data separate from the main instances can only hurt you in the long run.

baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

How many players and entities are we talking here? How often do you update the state, is it a fast action game or something turn-based? If the demands aren't high you can probably just do whatever feels simple and makes the most sense, without worrying about efficiency

What you're saying makes sense, there's actually a bidirectional dict library that uses a pair of plain dicts under the hood, so maintaining two views (one from each side) is definitely an approach people use

Do you actually need to maintain that full state though? It depends on what you need to do with the information, but from what you're saying everything relies on a player being able to see it happening - players need to know which entities they can see, and entities need to interact with each other if a player can see them. So you could just update the players' "what I can see" dict, and then combine their sets to get the set of all visible entities, and tell each of those to do their thing. Also saves you iterating over the larger group of entities to see if each is visible

You could also drop the state and just let your visibility check function tell the entity and players to do something if any of them are visible - but it really depends on what your game is going to need and if it's worth storing the visibility information for further use

baka kaba fucked around with this message at 14:12 on Jan 19, 2017

onionradish
Jul 6, 2006

That's spicy.

creatine posted:

Question: I am using Tkinter and Tweepy to build a simple python twitter gui. I have an entry box and button that I can send tweets from (works fine).

I then have a text box and two buttons for it. One button should start a twitter stream and print new tweets as they appear into the text box. The other button should stop the stream.

I have my Tkinter GUI set up as one class and the twitter portion set up as another. Upon starting the stream I can get my script to print the tweets to the console but I can't figure out if there is a way to write them to the tkinter text box.

I put the entire script on pastebin here :http://pastebin.com/S6sK92Hh

the settings import is just a file that looks like this:
code:
settings = {
	'consumer_key' : '',
	'consumer_secret' : '',
	'access_key' : '',
	'access_secret' : ''
}
 
I'm new to python, I've been messing with it on/off for maybe a couple months. I don't even know if what I'm doing is possible or not. But I appreciate any advice.

edit: I think I may have found a way to do this using tk.Frame that I was not utilizing earlier. Will update if I get it to work.

2edit: Rewrote a bunch http://pastebin.com/HEdzvJM9 still can't get it to write new tweets to textbox
It works for me on 3.5.1/Windows 7. Change your except: pass line to except: raise to make sure you're not suppressing some error.

Also, you should get rid of the Update class and move the methods to the GUI class. You never create an Update instance. That's why you're having to pass "self" as an argument to call those methods. It also makes "self" values misleading -- they look like they're referring to an instance of Update when they're really an instance of GUI. PM if you have questions.

Eela6
May 25, 2007
Shredded Hen
Dicts seem fine, so long as you don't have many players and you only need to track which players see enemies, not all entities which see each other.

I would just use a set for each player. If you want to check if a monster is visible to a particular player, just check membership in the set. It's very fast! Checking membership for each player is not a hindrance so long as you have a small amount of players. Most games have 1-16, so I think you'll be OK.

Also, as mentioned above, you might want to keep track of the intersection of all visible enemies and only check actions for enemies which are visible to some player.
I. E,
code:


class Player():
    ...
    def update_visible_enemies(self, enemies):
         self.visible_enemies = {enemy for enemy in enemies if self.visible(enemy)}

def notify_players_of_monster(players):
    all_visible_monsters = {monster for player in players for monster in player. visible_monsters}
    for monster in all_visible_monsters:
        for player in players:
             if monster in player.visible_enemies:
                notify_player(monster)

Eela6 fucked around with this message at 15:24 on Jan 19, 2017

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Whybird posted:

but having the data held in this way will presumably mean that I don't have to search through the entirety of the visibility data each time I need to iterate through one player's visible entities or one entity's viewing players.

I'm not speaking directly to you as it sound like you're in a state where you're not even sure what makes the most conceptual sense, but...as a general rule you should use the data structure that makes the most sense conceptually and then if you hit a performance issue consider changing it.

Eela6
May 25, 2007
Shredded Hen
"Use the simplest thing that could possibly work" is one of the programming aphorisms that has served me best.

With the following addendums:
'don't use a list to test membership '
'don't use a list when you need a deque '
'don't use sort() to maintain a sort when you could use bisect.insort()'

creatine
Jan 27, 2012




onionradish posted:

It works for me on 3.5.1/Windows 7. Change your except: pass line to except: raise to make sure you're not suppressing some error.

Also, you should get rid of the Update class and move the methods to the GUI class. You never create an Update instance. That's why you're having to pass "self" as an argument to call those methods. It also makes "self" values misleading -- they look like they're referring to an instance of Update when they're really an instance of GUI. PM if you have questions.

Hey thanks! I moved some stuff around and got it to work as intended. Here's my working code as of now. http://pastebin.com/E8aD9q7v

Basically make a file called settings.py like I said above or just plug in your own twitter app credentials and it should work. python 3.6 /win10

Whybird
Aug 2, 2009

Phaiston have long avoided the tightly competetive defence sector, but the IRDA Act 2052 has given us the freedom we need to bring out something really special.

https://team-robostar.itch.io/robostar


Nap Ghost
Wow! Thank you all for the advice, this is really helpful!

baka kaba posted:

Do you actually need to maintain that full state though? It depends on what you need to do with the information, but from what you're saying everything relies on a player being able to see it happening - players need to know which entities they can see, and entities need to interact with each other if a player can see them. So you could just update the players' "what I can see" dict, and then combine their sets to get the set of all visible entities, and tell each of those to do their thing.

I've just been watching https://www.youtube.com/watch?v=C4Kc8xzcA68 on hash tables and I realise I've been assuming that membership testing in a set is much, much slower than it is (also, I'm absolutely astounded by just how clever hash tables are!) -- this was why I was wary to have each entity check in turn whether or not it was in a player's visible_entities list, because it felt like that would be a lot of time-consuming checks to make. The idea of using the intersection of all visible entities instead is a really clever one, and I like it a lot!

Tigren
Oct 3, 2003

Whybird posted:

Wow! Thank you all for the advice, this is really helpful!


I've just been watching https://www.youtube.com/watch?v=C4Kc8xzcA68 on hash tables and I realise I've been assuming that membership testing in a set is much, much slower than it is (also, I'm absolutely astounded by just how clever hash tables are!) -- this was why I was wary to have each entity check in turn whether or not it was in a player's visible_entities list, because it felt like that would be a lot of time-consuming checks to make. The idea of using the intersection of all visible entities instead is a really clever one, and I like it a lot!

I haven't watched that video, but 7 years is a long time in Python time. Check out this video by Raymond Hettinger (a super awesome core dev who does great, accessible presentations) about dicts in Python 3.6.

https://www.youtube.com/watch?v=p33CVV29OG8

A few other Hettinger videos

https://www.youtube.com/watch?v=wf-BqAjZb8M
https://www.youtube.com/watch?v=OSGv2VnC0go
https://www.youtube.com/watch?v=Bv25Dwe84g0

Eela6
May 25, 2007
Shredded Hen
Hash tables are super cool. Remember that you can create your own __hash__() for classes, if you'd like. It's not usually necessary, because user-defined classes will default to the object id for __hash__(), but it can be useful if you're creating data types that should always hash the same.

As a toy example (inspired by Fluent Python)
code:

class Point3():
    def __init__(self,  x,  y,  z):
        self.x, self.y, self.z = x,  y,  z
    def __hash__(self):
        return hash(self.x) ^ hash(self.y) ^ hash(self.z)

Raymond Hettinger is amazing.

Mad Jaqk
Jun 2, 2013

Whybird posted:

Wow! Thank you all for the advice, this is really helpful!


I've just been watching https://www.youtube.com/watch?v=C4Kc8xzcA68 on hash tables and I realise I've been assuming that membership testing in a set is much, much slower than it is (also, I'm absolutely astounded by just how clever hash tables are!) -- this was why I was wary to have each entity check in turn whether or not it was in a player's visible_entities list, because it felt like that would be a lot of time-consuming checks to make. The idea of using the intersection of all visible entities instead is a really clever one, and I like it a lot!

As I understand it, Python sets are also implemented as hash tables under the hood (only caring about the keys, not values), so checking membership in a set should be equivalent to checking if a key exists in a dict.

pmchem
Jan 22, 2010


Whybird posted:

I'm picking up Python and I have a question about data structures, and which one's best.

(words)

Is the approach I'm describing reasonable, or is there something I've misunderstood about how fast different data structures are?

There are many ways of solving that general problem, but in molecular dynamics it's handled this way:
https://en.wikipedia.org/wiki/Cell_lists
https://en.wikipedia.org/wiki/Verlet_list

Master_Odin
Apr 15, 2010

My spear never misses its mark...

ladies
So a system I work on is setup in that there are 20 cron jobs (run at 3 minute intervals) where each spawns a new instance of a shell script with a different argument (which says what user to use to run a different process).

Essentially:
code:
00  * * * *   /usr/local/submitty/bin/grade_students.sh  untrusted00  >  /dev/null
03  * * * *   /usr/local/submitty/bin/grade_students.sh  untrusted03  >  /dev/null
06  * * * *   /usr/local/submitty/bin/grade_students.sh  untrusted06  >  /dev/null
09  * * * *   /usr/local/submitty/bin/grade_students.sh  untrusted09  >  /dev/null
up to 57.

Each of these scripts then checks a directory for the next file in a queue, then operates on it. If no new file is found, it sleeps for 5 and then checks again (running for an hour, before dying and then restarting). It also deals with using file descriptors and locks so that only one instance is looking at the directory at a time and it's overly complicated for this.

And then also does stuff like check all running processes in case a job that has gone over an hour long (so don't reuse the user) and all that.

This is obviously something I think can be improved via a long running python script that handles which untrusted ids are available and incoming files. I'm guessing I need some combination of using multiprocessing and watchdog, but I'm not sure how I want to approach this and would love to know if anyone has any advice, since this is extremely new territory for me with python. I guess a simple approach would be to only use multiprocessing so that if an untrusted id is available, check the directory and spin up an instance if a file is there, otherwise sleep for 3 minutes, but that seems less "good" than also using watchdog.

Master_Odin fucked around with this message at 03:45 on Jan 20, 2017

Eela6
May 25, 2007
Shredded Hen
This sounds like a job for coroutines and async/await, rather than multiprocessing. Are you sure you need multiprocessing? This sounds like an IO-bound rather than cpu-bound process, especially since you're hitting the filesystem so often.

If you use async , you can avoid locks entirely. Which is pretty cool! If you're not familiar with async in python 3.6 (and considering it's so new, few are), consider watching Raymond Hettinger's recent talk from the python convention in Russia. The asyncio module in the standard library is what you're looking for.

(asyncio is available in python 3.4 and 3.5, but it's easier and clearer in python 3.6 with the new async and await keywords, not to mention asynchronous comprehensions)

Eela6 fucked around with this message at 04:03 on Jan 20, 2017

Tigren
Oct 3, 2003

Eela6 posted:

This sounds like a job for coroutines and async/await, rather than multiprocessing. Are you sure you need multiprocessing? This sounds like an IO-bound rather than cpu-bound process, especially since you're hitting the filesystem so often.

If you use async , you can avoid locks entirely. Which is pretty cool! If you're not familiar with async in python 3.6 (and considering it's so new, few are), consider watching Raymond Hettinger's recent talk from the python convention in Russia. The asyncio module in the standard library is what you're looking for.

(asyncio is available in python 3.4 and 3.5, but it's easier and clearer in python 3.6 with the new async and await keywords, not to mention asynchronous comprehensions)

Like one of these!

Tigren fucked around with this message at 04:46 on Jan 20, 2017

Lysidas
Jul 26, 2002

John Diefenbaker is a madman who thinks he's John Diefenbaker.
Pillbug

Eela6 posted:

Hash tables are super cool. Remember that you can create your own __hash__() for classes, if you'd like. It's not usually necessary, because user-defined classes will default to the object id for __hash__(), but it can be useful if you're creating data types that should always hash the same.

As a toy example (inspired by Fluent Python)
code:
class Point3():
    def __init__(self,  x,  y,  z):
        self.x, self.y, self.z = x,  y,  z
    def __hash__(self):
        return hash(self.x) ^ hash(self.y) ^ hash(self.z)
Raymond Hettinger is amazing.

Good example, with one minor tweak/caveat: it's very desirable for a hash function to be uniformly distributed (as much as possible), and a lot of work typically goes in to designing hash functions for built-in types to provide good performance when these are used as hash table keys. It's rather easy to accidentally destroy this property by mixing hash functions, and though using bitwise XOR isn't bad, if possible it's a good idea to use built-in functionality for this:

code:
Python 3.6.0 (default, Dec 23 2016, 08:25:24) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> hash(3 ^ 4 ^ 5)
2
>>> hash(tuple([3, 4, 5]))
5050912105994302585

Eela6
May 25, 2007
Shredded Hen
Excellent point. Thanks for the correction!

Lysidas
Jul 26, 2002

John Diefenbaker is a madman who thinks he's John Diefenbaker.
Pillbug
Was going to edit my post, but no point in doing that after a reply. I was going to say that implementing __hash__ as returning hash(tuple(stuff in your object)) also works with non-numeric data:

code:
>>> hash(tuple([4, 'hello', frozenset({9, 10})]))
-375596027188019708
EDIT: guess I'll edit this post too, then. I've seen people combine the hashes of non-numeric things, like

code:
    def __hash__(self):
        return hash(self.obj1) ^ hash(self.obj2)
and this is closer to what I wanted to describe as "don't do this if you can avoid it". It's rather easy to accidentally introduce some pattern in that kind of hash combination, like "hashes are always a multiple of 3" or something similar.

Lysidas fucked around with this message at 17:21 on Jan 20, 2017

Cingulate
Oct 23, 2012

by Fluffdaddy
This thread is good and I'm learning so many things.

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

Lysidas posted:

Was going to edit my post, but no point in doing that after a reply. I was going to say that implementing __hash__ as returning hash(tuple(stuff in your object)) also works with non-numeric data:

code:
>>> hash(tuple([4, 'hello', frozenset({9, 10})]))
-375596027188019708
EDIT: guess I'll edit this post too, then. I've seen people combine the hashes of non-numeric things, like

code:
    def __hash__(self):
        return hash(self.obj1) ^ hash(self.obj2)
and this is closer to what I wanted to describe as "don't do this if you can avoid it". It's rather easy to accidentally introduce some pattern in that kind of hash combination, like "hashes are always a multiple of 3" or something similar.

This Stack Overflow post discusses the use of straightforward arithmetic involving prime numbers in creating a good hashcode implementation (in C#):

http://stackoverflow.com/questions/263400/what-is-the-best-algorithm-for-an-overridden-system-object-gethashcode

It is made a bit harder perhaps by the fact that Python integers are arbitrary-precision rather than 32-bit. I don't know how hashcodes work internally in Python so I don't know how big an issue this is or how hard it is to get around it.

creatine
Jan 27, 2012




Are there any exe utilities that have been updated for 3.6?

Dominoes
Sep 20, 2007

Anecdote of the current state of installing Python packages that include C/Fortran code in windows with pip instead of conda:

The following packages work without a problem (Ie I loaded them in a requirements.txt, then 'ran pip install -r requirements.txt'.)

numpy
pandas
matplotlib
jupyter
sympy
PyQt5
keras
requests
requests_oauthlib
pytest
toolz
cytoolz
beautifulsoup4
django-toolbelt

The following required Chris Gohlke's site:

scipy
scikit-learn
h5py
tables

Numba installs from pip only after installing llvmlite from CG's site.

Dominoes fucked around with this message at 12:05 on Jan 21, 2017

floppo
Aug 24, 2005
I'm looking to parse text data from three kinds of sources: pdf, .doc and .docx. I have around a thousand of each. I'm surprised at how difficult it has been to extract the text of even a docx file into a Python string (which I'd then write to a plain txt). I'm using Python3 and don't have easy access to Windows to use wincom32.

I have googled around and found a lot of solutions that don't seem to work. For instance PyPDF2 only prints blanks when I loop over the pages and print their text content. The python docx library doesnt work in Python3, even after tinkering with it (removing exception imports, adding () to print statements).

I guess my question is if you guys have any good, simple method to turn any of these documents in txts?

Cingulate
Oct 23, 2012

by Fluffdaddy

floppo posted:

I'm looking to parse text data from three kinds of sources: pdf, .doc and .docx. I have around a thousand of each. I'm surprised at how difficult it has been to extract the text of even a docx file into a Python string (which I'd then write to a plain txt). I'm using Python3 and don't have easy access to Windows to use wincom32.

I have googled around and found a lot of solutions that don't seem to work. For instance PyPDF2 only prints blanks when I loop over the pages and print their text content. The python docx library doesnt work in Python3, even after tinkering with it (removing exception imports, adding () to print statements).

I guess my question is if you guys have any good, simple method to turn any of these documents in txts?
If you can't find anything else, you can use Pandoc to convert the doc into something you can parse.

baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

floppo posted:

I'm looking to parse text data from three kinds of sources: pdf, .doc and .docx. I have around a thousand of each. I'm surprised at how difficult it has been to extract the text of even a docx file into a Python string (which I'd then write to a plain txt).

Have you looked at one in a text editor? They're not exactly trivial to parse

Can't help with suggestions, but make sure your PDFs actually have text content - sometimes they're just page images

Adbot
ADBOT LOVES YOU

Dominoes
Sep 20, 2007

pdfminer works well enough, but is slow.

  • Locked thread