Python information and short questions megathread.

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›484 »

Plasmafountain: Jun 17, 2008

What are people's recommendations for modules to scape information from and then interact with a website? I know beautifulsoup is very popular but theres a lot of others that get mentioned as well.

Application is creating a bot to first scrape tabulated info, then analyse & implement the best moves to make in an online management game.

# ? Oct 16, 2015 21:20

Adbot: ADBOT LOVES YOU

# ? May 8, 2024 05:24

jerry seinfel: Jun 25, 2007

I've been working on a simple script using Selenium that I'm still working on. Selenium can parse HTML and interact with webpage elements, though if you're using a webpage from 2003 you might run into issues with frames.

I should probably use Beautiful Soup for my script more, but I'm using Selenium to log in to a page using user-provided credentials, navigate to another page, and then identify the first <tr> that matches a specific color and select the adjacent multiple-select box. From there it reads from specific files based on the content of one of the <td> elements and interacts with drop down menus to select the value from the file. Then it cycles until a specific message appears on the page. Or at least it'll do it when I get around to finishing the script instead of doing the task manually.

I'm still working on it (and a novice) but you can look for elements by ID, xpath, class, and so on. It's pretty handy.

# ? Oct 16, 2015 21:32

fritz: Jul 26, 2003

QuarkJets posted:

Joining on an empty string is kind of pointless, really, because strings are immutable: in Python you can never put more or less stuff into a string, you can only create a new string.

Do you mean

code:

"".join( ... )

? Because I use that all the time on generators and list comprehensions.

# ? Oct 17, 2015 03:25

vikingstrike: Sep 23, 2007; whats happening, captain

I usually use a combination of requests and Beautiful Soup to download and scrape information from websites. Not sure if there is anything newer that has supplanted this. For working with tabular data I use pandas.

# ? Oct 17, 2015 03:26

QuarkJets: Sep 8, 2008

fritz posted:

Do you mean
code:
"".join( ... )
? Because I use that all the time on generators and list comprehensions.

I meant:

code:

some_string = "foo"
new_string = "".join(some_string)

But only because I thought reversed() returned a string; apparently it returns an iterator

# ? Oct 17, 2015 04:38

nonathlon: Jul 9, 2004; And yet, somehow, now it's my fault ...

Not actually a Python question, more of a restructured text question but: is there some facility in ReST to optionally show / hide content?

Wider context: I'm writing a Xmas pub quiz and naturally thought that it would be easy to do it all in ReST, including answers and then produce printouts with and without the answers. Seems an obvious use case, and the various slides program do something similar with notes, but I can't find any prior art.

# ? Oct 19, 2015 13:50

Thots and Prayers: Jul 13, 2006; A is the for the atrocious abominated acts that YOu committed. A is also for ass-i-nine, eight, seven, and six.

B, b, b - b is for your belligerent, bitchy, bottomless state of affairs, but why?

C is for the cantankerous condition of our character, you have no cut-out.; Grimey Drawer

Zero Gravitas posted:

What are people's recommendations for modules to scape information from and then interact with a website? I know beautifulsoup is very popular but theres a lot of others that get mentioned as well.

Application is creating a bot to first scrape tabulated info, then analyse & implement the best moves to make in an online management game.

BSoup bogs down on giant tables (like 20M+) but otherwise works great.

# ? Oct 19, 2015 16:48

Hughmoris: Apr 21, 2007; Let's go to the abyss!

I'm trying to explore and understand pandas and numpy. i have a data set that looks like this:

code:

UNIT        DISCHARGE DATE             DISCHARGE TO      DISCHARGE DELAY(hh:mm:ss)
CARD    10/01/2015 15:10:00    10/01/2015 06:51:42    8:18:18
NEUR    10/01/2015 10:15:00    10/01/2015 07:13:58    3:01:02
SURG    10/01/2015 09:30:00    10/01/2015 07:15:38    2:14:22
CARD    10/01/2015 11:23:00    10/01/2015 07:17:27    4:05:33
CARD    10/01/2015 15:20:00    10/01/2015 07:22:01    7:57:59
NEUR    10/01/2015 14:26:00    10/01/2015 07:23:12    7:02:48
...
...

Is there a simple way to get an average of "DISCHARGE DELAY" for each unit? I'd like to get the average "DISCHARGE DELAY" for each unit, then do a simple graph to display those averages.

# ? Oct 20, 2015 02:39

OnceIWasAnOstrich: Jul 22, 2006

Hughmoris posted:

I'm trying to explore and understand pandas and numpy. i have a data set that looks like this:
code:
UNIT        DISCHARGE DATE             DISCHARGE TO      DISCHARGE DELAY(hh:mm:ss)
CARD    10/01/2015 15:10:00    10/01/2015 06:51:42    8:18:18
NEUR    10/01/2015 10:15:00    10/01/2015 07:13:58    3:01:02
SURG    10/01/2015 09:30:00    10/01/2015 07:15:38    2:14:22
CARD    10/01/2015 11:23:00    10/01/2015 07:17:27    4:05:33
CARD    10/01/2015 15:20:00    10/01/2015 07:22:01    7:57:59
NEUR    10/01/2015 14:26:00    10/01/2015 07:23:12    7:02:48
...
...
Is there a simple way to get an average of "DISCHARGE DELAY" for each unit? I'd like to get the average "DISCHARGE DELAY" for each unit, then do a simple graph to display those averages.

Something like slice the dataframe with df[df['UNIT']==unit].mean() as long as mean() works with times like that.

Also you can use .query() but I don't know how to use .query() because it causes some Exception in the underlying parsing library...I should try to resolve that problem.

# ? Oct 20, 2015 02:47

My Rhythmic Crotch: Jan 13, 2011

In this post I am going to complain about a python module that fails at doing the one thing it's supposed to do: Advanced Python Scheduler.

- instead of storing the schedule (and other config) in a human readable format, it serializes everything to bytes so you cannot read it with your own eyes, or use things like SQL "update" to change the schedule once it has been saved.
- the only way of changing the schedule of a job is to use the reschedule_job() method, however that method always throws an AttributeError, essentially meaning once the schedule has been made, ya can't change it.
- for reasons unknown, the scheduler quit executing jobs, which I could live with, but it never raised an exception, and only logged a warning. The specific message given was "Run time of job <blah> next run at: <blah> was missed by <a few seconds>". The VM could have been starved for resources, or maybe there was an NTP glitch or something, I dunno. There does not seem to be any consensus about what could cause it. I ended up dropping and recreating the jobs, and also modified the job creation code to use a 30 second grace period.

People, I beg you, before you create your applications, spend a bit of time vetting the modules you are going to use. My vetting process is:
- install
- read docs and do "hello world"
- try out a few obvious error cases
- try some of the logic I will need in my app

If I can't get through those steps quickly and without major drama, or I find other issues like those mentioned above, I don't use the module. It's better to vet first and use quality stuff in production rather than have it blow up or stop working right when some manager "NEEDS IT RIGHT NOW".

# ? Oct 20, 2015 02:52

vikingstrike: Sep 23, 2007; whats happening, captain

Hughmoris posted:

I'm trying to explore and understand pandas and numpy. i have a data set that looks like this:
code:
UNIT        DISCHARGE DATE             DISCHARGE TO      DISCHARGE DELAY(hh:mm:ss)
CARD    10/01/2015 15:10:00    10/01/2015 06:51:42    8:18:18
NEUR    10/01/2015 10:15:00    10/01/2015 07:13:58    3:01:02
SURG    10/01/2015 09:30:00    10/01/2015 07:15:38    2:14:22
CARD    10/01/2015 11:23:00    10/01/2015 07:17:27    4:05:33
CARD    10/01/2015 15:20:00    10/01/2015 07:22:01    7:57:59
NEUR    10/01/2015 14:26:00    10/01/2015 07:23:12    7:02:48
...
...
Is there a simple way to get an average of "DISCHARGE DELAY" for each unit? I'd like to get the average "DISCHARGE DELAY" for each unit, then do a simple graph to display those averages.

Phone posting, but get discharge delay into numeric units and then use group by:

df.groupby("unit", sort=True)["discharge_delay"].mean()

This will find the mean discharge delay for each unit in the data. You can then take the series that's returned and easily plot a bar graph in matplotlib or the the built in pandas functions.

# ? Oct 20, 2015 03:53

Cingulate: Oct 23, 2012; by Fluffdaddy

Like 50% of posts on the first site of http://stackoverflow.com right now are about Python.

# ? Oct 22, 2015 20:56

Dr Monkeysee: Oct 11, 2002; just a fox like a hundred thousand others; Nap Ghost

My Rhythmic Crotch posted:

In this post I am going to complain about a python module that fails at doing the one thing it's supposed to do: Advanced Python Scheduler.

- the only way of changing the schedule of a job is to use the reschedule_job() method, however that method always throws an AttributeError, essentially meaning once the schedule has been made, ya can't change it.

add_job() with replace_existing=True will allow you to modify the job on the fly, including the schedule. not sure why modify_job() and reschedule_job() don't work this way but whatever!

# ? Oct 22, 2015 22:00

Cingulate: Oct 23, 2012; by Fluffdaddy

Is there a smarter way to achieve this:

code:

def a_function(thing):
    try:
        ... # will take a long time and is expected to fail for some input
        return some_value
    except SomeError:
        return None

d = {k:a_function(v) for k, v in input_dict.items()}
d = {k:v for k, v in d.items() if v != None}

# ? Oct 23, 2015 23:59

Emacs Headroom: Aug 2, 2003

I think that's fine, except you should say 'v is not None' instead of 'v != None'

# ? Oct 24, 2015 04:32

QuarkJets: Sep 8, 2008

Cingulate posted:

Is there a smarter way to achieve this:

Instead of defining a function, you could define a generator. If the exception is raised then you simply wouldn't yield anything, otherwise you'd yield the (key, value) pair. This would let you define one dictionary instead of two

# ? Oct 24, 2015 05:38

Cingulate: Oct 23, 2012; by Fluffdaddy

QuarkJets posted:

Instead of defining a function, you could define a generator. If the exception is raised then you simply wouldn't yield anything, otherwise you'd yield the (key, value) pair. This would let you define one dictionary instead of two

So, uh ...

code:

def a_function(thing):
    try:
        ... # will take a long time and is expected to fail for some input
        yield some_value
    except SomeError:
        pass

d = {k:a_function(v) for k, v in input_dict.items()}

Would that work? I don't know generators really.

# ? Oct 24, 2015 12:33

Asymmetrikon: Oct 30, 2009; I believe you're a big dork!

You could do something like:

code:

def map_a_function(things):
    for k, v in things.items():
        try:
            # uses v to produce some_value, may fail
            yield k, some_value
        except SomeError:
            continue

d = dict(map_a_function(input_dict))

# ? Oct 24, 2015 15:46

SurgicalOntologist: Jun 17, 2004

Your loop should go inside the generator.

Personally, I would factor out the exception-skipping from the actual logic of the function.

Python code:

def a_function(thing):
    ...
    return some_value


def dict_apply_skip_exceptions(func, inputs, exception):
    for key, value in inputs.items():
        try:
            yield key, func(input)
        except exception:
            pass


d = dict(dict_apply_skip_exceptions(a_function, input_dict, SomeError))

# ? Oct 24, 2015 15:51

QuarkJets: Sep 8, 2008

SurgicalOntologist posted:

Your loop should go inside the generator.

Personally, I would factor out the exception-skipping from the actual logic of the function.
Python code:
def a_function(thing):
    ...
    return some_value


def dict_apply_skip_exceptions(func, inputs, exception):
    for key, value in inputs.items():
        try:
            yield key, func(input)
        except exception:
            pass


d = dict(dict_apply_skip_exceptions(a_function, input_dict, SomeError))

I like this one. Defining the exception that you catch as a function input kind of makes me frown but I don't really know why

# ? Oct 24, 2015 21:05

Nippashish: Nov 2, 2005; Let me see you dance!

I'd just use the original version. Or probably write out the for loop explicitly. Just because comprehensions exist doesn't mean you need to use them.

# ? Oct 24, 2015 21:09

Cingulate: Oct 23, 2012; by Fluffdaddy

Nippashish posted:

I'd just use the original version. Or probably write out the for loop explicitly. Just because comprehensions exist doesn't mean you need to use them.

I don't know man, my code is mostly comprehensions. I can't remember stuff like np.in1d(np.where or list().index or whatever, it's all comprehensions.

I'm grateful for all the proposals, I'm not sure I can apply any of them directly, but they're all showing me stuff I hadn't thought of that'll be useful otherwise.

Note, I'm actually running the function inside joblib so it may all be a bit more problematic.

# ? Oct 24, 2015 21:14

KernelSlanders: May 27, 2013; Rogue operating systems on occasion spread lies and rumors about me.

What's wrong with the comprehension here? It seems pretty clean.

# ? Oct 24, 2015 22:09

Cingulate: Oct 23, 2012; by Fluffdaddy

It's two, and they are fairly redundant.

# ? Oct 24, 2015 22:36

Emacs Headroom: Aug 2, 2003

Cingulate posted:

It's two, and they are fairly redundant.

Yeah there are two, but until there's a nice equivalent in Python for something like Scala's flatmap, I think what you have is clean and readable. First you map your values to the function outputs, then you filter out the None's.

# ? Oct 24, 2015 22:44

baka kaba: Jul 19, 2003; PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

You could do something like

Python code:

def a_function(whatever):
    try:
        # bla bla
    except SomeError:
        return None

results = ((k, a_function(v)) for k, v in input_dict.items())
filtered_results  = ((k, v) for k, v in results if v)
my_dict = dict(filtered_results)

where you're basically building a pipeline of simple one-line generators that filter the output of previous ones

# ? Oct 25, 2015 00:12

QuarkJets: Sep 8, 2008

KernelSlanders posted:

What's wrong with the comprehension here? It seems pretty clean.

There's nothing wrong with the original pair of comprehensions, it's clean and totally good code. But the thread was asked for alternatives, and that's fun, so we're coming up with alternatives

# ? Oct 25, 2015 01:00

Cingulate: Oct 23, 2012; by Fluffdaddy

Is there a painless way of moving my primary (3.4) anaconda environment to 3.5? I have a huge bunch of conda packages installed, some from binstar.

# ? Oct 26, 2015 01:20

Proteus Jones: Feb 28, 2013

Cingulate posted:

Is there a painless way of moving my primary (3.4) anaconda environment to 3.5? I have a huge bunch of conda packages installed, some from binstar.

won't conda update --all do it?

EDIT: No it won't. Sorry I thought they had released 3.5 and that's what you were asking.

Proteus Jones fucked around with this message at 02:06 on Oct 26, 2015

# ? Oct 26, 2015 02:04

SurgicalOntologist: Jun 17, 2004

conda install python=3.5.* worked for me. But that may have been a bad idea because now conda update --all doesn't work, so I suspect there are some incompatibilities present in the packages I already have installed.

# ? Oct 26, 2015 03:30

Dominoes: Sep 20, 2007

I don't think there's a way to do what you ask. You can create a Python 3.5 virtual environment with this:

code:

conda create -n py35 python=3.5

If Continuum's on schedule, an official 3.5 release should come out within a few days, at which point you won't need to use a virtual environment.

# ? Oct 26, 2015 05:18

Cingulate: Oct 23, 2012; by Fluffdaddy

Oh well.

# ? Oct 26, 2015 09:57

Cingulate: Oct 23, 2012; by Fluffdaddy

Two more beginner questions, should anybody find the time to go through them ...

What's the difference between "&" and "and"? I know I can only use one of them for certain series stuff, but I don't get the principled reason.

I often use list comps to filter. But some terseness is lost when unpacking it to a regular for loop. Is there some magic hack I'm not seeing to make this

code:

for x in sequence if meets_condition(x):
    do a thing
    do another thing

instead of this:

code:

for x in sequence:
    if meets_condition(x):
        do a thing
        do another thing

I know I could loop over [x for x in sequence if meets_condition(x)] instead of over sequence, but that's not a win.

# ? Oct 27, 2015 12:39

KernelSlanders: May 27, 2013; Rogue operating systems on occasion spread lies and rumors about me.

The operator and is the builtin logical operator you should probably be using unless you're dealing with numpy arrays.

For the loop, you could pre-filter with a generator or list comprehension, although I don't know that it's any clearer than your second example.

Python code:

for x in (y for y in sequence if meets_condition(y)):
    do a thing
    do another thing

Python code:

filtered_sequence = [x for x in sequence if meets_condition(x)]
for x in filtered_sequence:
    do a thing
    do another thing

# ? Oct 27, 2015 14:22

Hammerite: Mar 9, 2007; And you don't remember what I said here, either, but it was pompous and stupid.; Jade Ear Joe

Cingulate posted:

Two more beginner questions, should anybody find the time to go through them ...

What's the difference between "&" and "and"? I know I can only use one of them for certain series stuff, but I don't get the principled reason.

I often use list comps to filter. But some terseness is lost when unpacking it to a regular for loop. Is there some magic hack I'm not seeing to make this
code:
for x in sequence if meets_condition(x):
    do a thing
    do another thing
instead of this:
code:
for x in sequence:
    if meets_condition(x):
        do a thing
        do another thing
I know I could loop over [x for x in sequence if meets_condition(x)] instead of over sequence, but that's not a win.

& is a bitwise operation. It means convert the two operands into sequences of bits, and return the integer value that results from &ing the sequences together. Or at least, for integers that's what it does, and in principle if you implement it for your own classes you don't have to return an int, you can return absolutely anything you want and it doesn't have to even have anything to do with bitwise operations. But at least for integers it's bitwise and.

Type this in and try it for some pairs of integers:

code:

def bitwise_and_demo (a, b):
    c = a & b
    a_str = bin(a)[2:]
    b_str = bin(b)[2:]
    c_str = bin(c)[2:]
    maxlen = max(len(a_str), len(b_str), len(c_str))
    a_str = a_str.zfill(maxlen)
    b_str = b_str.zfill(maxlen)
    c_str = c_str.zfill(maxlen)
    print(a_str)
    print(b_str)
    print(c_str)

Compare | (bitwise or), ^ (bitwise exclusive or), << (left shift), >> (right shift).

"and" and "or" on the other hand are logical operators. You should use & and | when you really mean to do operations on bits (which is a numerical calculation rather than a logical proposition), and you should use "and" and "or" when you really mean to express logical conjunction and disjunction ("I want to add VAT if the shipping address is in Texas AND Venus is rising in Capricorn"). To do otherwise is failing to effectively communicate what your code is meant to be doing.

# ? Oct 27, 2015 14:25

Cingulate: Oct 23, 2012; by Fluffdaddy

Ah okay thanks guys.

I indeed operate a lot on numpy array, or at least stuff that is very similar (pandas series).

# ? Oct 27, 2015 14:55

Emacs Headroom: Aug 2, 2003

I'm reasonably certain that internally Series objects use numpy arrays.

I use bit-wise and all the time with indexing on Series, say you have something stored in a series with a datetime index and you want the average within a time slice:

code:

s = pd.Series(some_numbers, index=some_dates)
s[(s.index >= datetime.datetime(2015, 10, 25)) & (s.index < datetime.datetime(2015, 10, 26))].mean()

will give you the mean of your entries gathered on the 25th of october

# ? Oct 28, 2015 04:05

SurgicalOntologist: Jun 17, 2004

If you're just comparing to the index may as well slice:

Python code:

s[datetime(2015, 10, 25):datetime(2015, 10, 26)].mean()

# ? Oct 28, 2015 04:27

Hammerite: Mar 9, 2007; And you don't remember what I said here, either, but it was pompous and stupid.; Jade Ear Joe

Emacs Headroom posted:

I'm reasonably certain that internally Series objects use numpy arrays.

I use bit-wise and all the time with indexing on Series, say you have something stored in a series with a datetime index and you want the average within a time slice:
code:
s = pd.Series(some_numbers, index=some_dates)
s[(s.index >= datetime.datetime(2015, 10, 25)) & (s.index < datetime.datetime(2015, 10, 26))].mean()
will give you the mean of your entries gathered on the 25th of october

This is an example of what I was talking about where whatever these object types are have had the behaviour of & customised to do something the library creator thought would be useful. Now, I don't know anything about this library, but it strikes me as bad. I would prefer to see code that relies less on the reader being intimately familiar with how the & and >= operators are implemented for this library's data types, at the cost of most likely being more verbose.

# ? Oct 28, 2015 11:28

Adbot: ADBOT LOVES YOU

# ? May 8, 2024 05:24

Cingulate: Oct 23, 2012; by Fluffdaddy

Hammerite posted:

This is an example of what I was talking about where whatever these object types are have had the behaviour of & customised to do something the library creator thought would be useful. Now, I don't know anything about this library, but it strikes me as bad. I would prefer to see code that relies less on the reader being intimately familiar with how the & and >= operators are implemented for this library's data types, at the cost of most likely being more verbose.

Arguably, Numpy is so big it'd be bad to not have it work like this.

# ? Oct 28, 2015 14:03

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›484 »