Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
Plasmafountain
Jun 17, 2008

What are people's recommendations for modules to scape information from and then interact with a website? I know beautifulsoup is very popular but theres a lot of others that get mentioned as well.

Application is creating a bot to first scrape tabulated info, then analyse & implement the best moves to make in an online management game.

Adbot
ADBOT LOVES YOU

jerry seinfel
Jun 25, 2007


I've been working on a simple script using Selenium that I'm still working on. Selenium can parse HTML and interact with webpage elements, though if you're using a webpage from 2003 you might run into issues with frames.

I should probably use Beautiful Soup for my script more, but I'm using Selenium to log in to a page using user-provided credentials, navigate to another page, and then identify the first <tr> that matches a specific color and select the adjacent multiple-select box. From there it reads from specific files based on the content of one of the <td> elements and interacts with drop down menus to select the value from the file. Then it cycles until a specific message appears on the page. Or at least it'll do it when I get around to finishing the script instead of doing the task manually.

I'm still working on it (and a novice) but you can look for elements by ID, xpath, class, and so on. It's pretty handy.

fritz
Jul 26, 2003

QuarkJets posted:

Joining on an empty string is kind of pointless, really, because strings are immutable: in Python you can never put more or less stuff into a string, you can only create a new string.


Do you mean
code:
"".join( ... )
? Because I use that all the time on generators and list comprehensions.

vikingstrike
Sep 23, 2007

whats happening, captain
I usually use a combination of requests and Beautiful Soup to download and scrape information from websites. Not sure if there is anything newer that has supplanted this. For working with tabular data I use pandas.

QuarkJets
Sep 8, 2008

fritz posted:

Do you mean
code:
"".join( ... )
? Because I use that all the time on generators and list comprehensions.

I meant:

code:
some_string = "foo"
new_string = "".join(some_string)
But only because I thought reversed() returned a string; apparently it returns an iterator

nonathlon
Jul 9, 2004
And yet, somehow, now it's my fault ...
Not actually a Python question, more of a restructured text question but: is there some facility in ReST to optionally show / hide content?

Wider context: I'm writing a Xmas pub quiz and naturally thought that it would be easy to do it all in ReST, including answers and then produce printouts with and without the answers. Seems an obvious use case, and the various slides program do something similar with notes, but I can't find any prior art.

Thots and Prayers
Jul 13, 2006

A is the for the atrocious abominated acts that YOu committed. A is also for ass-i-nine, eight, seven, and six.

B, b, b - b is for your belligerent, bitchy, bottomless state of affairs, but why?

C is for the cantankerous condition of our character, you have no cut-out.
Grimey Drawer

Zero Gravitas posted:

What are people's recommendations for modules to scape information from and then interact with a website? I know beautifulsoup is very popular but theres a lot of others that get mentioned as well.

Application is creating a bot to first scrape tabulated info, then analyse & implement the best moves to make in an online management game.

BSoup bogs down on giant tables (like 20M+) but otherwise works great.

Hughmoris
Apr 21, 2007
Let's go to the abyss!
I'm trying to explore and understand pandas and numpy. i have a data set that looks like this:
code:
UNIT        DISCHARGE DATE             DISCHARGE TO      DISCHARGE DELAY(hh:mm:ss)
CARD    10/01/2015 15:10:00    10/01/2015 06:51:42    8:18:18
NEUR    10/01/2015 10:15:00    10/01/2015 07:13:58    3:01:02
SURG    10/01/2015 09:30:00    10/01/2015 07:15:38    2:14:22
CARD    10/01/2015 11:23:00    10/01/2015 07:17:27    4:05:33
CARD    10/01/2015 15:20:00    10/01/2015 07:22:01    7:57:59
NEUR    10/01/2015 14:26:00    10/01/2015 07:23:12    7:02:48
...
...
Is there a simple way to get an average of "DISCHARGE DELAY" for each unit? I'd like to get the average "DISCHARGE DELAY" for each unit, then do a simple graph to display those averages.

OnceIWasAnOstrich
Jul 22, 2006

Hughmoris posted:

I'm trying to explore and understand pandas and numpy. i have a data set that looks like this:
code:
UNIT        DISCHARGE DATE             DISCHARGE TO      DISCHARGE DELAY(hh:mm:ss)
CARD    10/01/2015 15:10:00    10/01/2015 06:51:42    8:18:18
NEUR    10/01/2015 10:15:00    10/01/2015 07:13:58    3:01:02
SURG    10/01/2015 09:30:00    10/01/2015 07:15:38    2:14:22
CARD    10/01/2015 11:23:00    10/01/2015 07:17:27    4:05:33
CARD    10/01/2015 15:20:00    10/01/2015 07:22:01    7:57:59
NEUR    10/01/2015 14:26:00    10/01/2015 07:23:12    7:02:48
...
...
Is there a simple way to get an average of "DISCHARGE DELAY" for each unit? I'd like to get the average "DISCHARGE DELAY" for each unit, then do a simple graph to display those averages.

Something like slice the dataframe with df[df['UNIT']==unit].mean() as long as mean() works with times like that.

Also you can use .query() but I don't know how to use .query() because it causes some Exception in the underlying parsing library...I should try to resolve that problem.

My Rhythmic Crotch
Jan 13, 2011

In this post I am going to complain about a python module that fails at doing the one thing it's supposed to do: Advanced Python Scheduler.

- instead of storing the schedule (and other config) in a human readable format, it serializes everything to bytes so you cannot read it with your own eyes, or use things like SQL "update" to change the schedule once it has been saved.
- the only way of changing the schedule of a job is to use the reschedule_job() method, however that method always throws an AttributeError, essentially meaning once the schedule has been made, ya can't change it.
- for reasons unknown, the scheduler quit executing jobs, which I could live with, but it never raised an exception, and only logged a warning. The specific message given was "Run time of job <blah> next run at: <blah> was missed by <a few seconds>". The VM could have been starved for resources, or maybe there was an NTP glitch or something, I dunno. There does not seem to be any consensus about what could cause it. I ended up dropping and recreating the jobs, and also modified the job creation code to use a 30 second grace period.

People, I beg you, before you create your applications, spend a bit of time vetting the modules you are going to use. My vetting process is:
- install
- read docs and do "hello world"
- try out a few obvious error cases
- try some of the logic I will need in my app

If I can't get through those steps quickly and without major drama, or I find other issues like those mentioned above, I don't use the module. It's better to vet first and use quality stuff in production rather than have it blow up or stop working right when some manager "NEEDS IT RIGHT NOW".

vikingstrike
Sep 23, 2007

whats happening, captain

Hughmoris posted:

I'm trying to explore and understand pandas and numpy. i have a data set that looks like this:
code:

UNIT        DISCHARGE DATE             DISCHARGE TO      DISCHARGE DELAY(hh:mm:ss)
CARD    10/01/2015 15:10:00    10/01/2015 06:51:42    8:18:18
NEUR    10/01/2015 10:15:00    10/01/2015 07:13:58    3:01:02
SURG    10/01/2015 09:30:00    10/01/2015 07:15:38    2:14:22
CARD    10/01/2015 11:23:00    10/01/2015 07:17:27    4:05:33
CARD    10/01/2015 15:20:00    10/01/2015 07:22:01    7:57:59
NEUR    10/01/2015 14:26:00    10/01/2015 07:23:12    7:02:48
...
...

Is there a simple way to get an average of "DISCHARGE DELAY" for each unit? I'd like to get the average "DISCHARGE DELAY" for each unit, then do a simple graph to display those averages.

Phone posting, but get discharge delay into numeric units and then use group by:

df.groupby("unit", sort=True)["discharge_delay"].mean()

This will find the mean discharge delay for each unit in the data. You can then take the series that's returned and easily plot a bar graph in matplotlib or the the built in pandas functions.

Cingulate
Oct 23, 2012

by Fluffdaddy
Like 50% of posts on the first site of http://stackoverflow.com right now are about Python.

Dr Monkeysee
Oct 11, 2002

just a fox like a hundred thousand others
Nap Ghost

My Rhythmic Crotch posted:

In this post I am going to complain about a python module that fails at doing the one thing it's supposed to do: Advanced Python Scheduler.

- the only way of changing the schedule of a job is to use the reschedule_job() method, however that method always throws an AttributeError, essentially meaning once the schedule has been made, ya can't change it.

add_job() with replace_existing=True will allow you to modify the job on the fly, including the schedule. not sure why modify_job() and reschedule_job() don't work this way but whatever!

Cingulate
Oct 23, 2012

by Fluffdaddy
Is there a smarter way to achieve this:

code:
def a_function(thing):
    try:
        ... # will take a long time and is expected to fail for some input
        return some_value
    except SomeError:
        return None

d = {k:a_function(v) for k, v in input_dict.items()}
d = {k:v for k, v in d.items() if v != None}

Emacs Headroom
Aug 2, 2003
I think that's fine, except you should say 'v is not None' instead of 'v != None'

QuarkJets
Sep 8, 2008

Cingulate posted:

Is there a smarter way to achieve this:

Instead of defining a function, you could define a generator. If the exception is raised then you simply wouldn't yield anything, otherwise you'd yield the (key, value) pair. This would let you define one dictionary instead of two

Cingulate
Oct 23, 2012

by Fluffdaddy

QuarkJets posted:

Instead of defining a function, you could define a generator. If the exception is raised then you simply wouldn't yield anything, otherwise you'd yield the (key, value) pair. This would let you define one dictionary instead of two
So, uh ...

code:
def a_function(thing):
    try:
        ... # will take a long time and is expected to fail for some input
        yield some_value
    except SomeError:
        pass

d = {k:a_function(v) for k, v in input_dict.items()}
Would that work? I don't know generators really.

Asymmetrikon
Oct 30, 2009

I believe you're a big dork!
You could do something like:

code:
def map_a_function(things):
    for k, v in things.items():
        try:
            # uses v to produce some_value, may fail
            yield k, some_value
        except SomeError:
            continue

d = dict(map_a_function(input_dict))

SurgicalOntologist
Jun 17, 2004

Your loop should go inside the generator.

Personally, I would factor out the exception-skipping from the actual logic of the function.

Python code:
def a_function(thing):
    ...
    return some_value


def dict_apply_skip_exceptions(func, inputs, exception):
    for key, value in inputs.items():
        try:
            yield key, func(input)
        except exception:
            pass


d = dict(dict_apply_skip_exceptions(a_function, input_dict, SomeError))

QuarkJets
Sep 8, 2008

SurgicalOntologist posted:

Your loop should go inside the generator.

Personally, I would factor out the exception-skipping from the actual logic of the function.

Python code:
def a_function(thing):
    ...
    return some_value


def dict_apply_skip_exceptions(func, inputs, exception):
    for key, value in inputs.items():
        try:
            yield key, func(input)
        except exception:
            pass


d = dict(dict_apply_skip_exceptions(a_function, input_dict, SomeError))

I like this one. Defining the exception that you catch as a function input kind of makes me frown but I don't really know why

Nippashish
Nov 2, 2005

Let me see you dance!
I'd just use the original version. Or probably write out the for loop explicitly. Just because comprehensions exist doesn't mean you need to use them.

Cingulate
Oct 23, 2012

by Fluffdaddy

Nippashish posted:

I'd just use the original version. Or probably write out the for loop explicitly. Just because comprehensions exist doesn't mean you need to use them.
I don't know man, my code is mostly comprehensions. I can't remember stuff like np.in1d(np.where or list().index or whatever, it's all comprehensions.

I'm grateful for all the proposals, I'm not sure I can apply any of them directly, but they're all showing me stuff I hadn't thought of that'll be useful otherwise.

Note, I'm actually running the function inside joblib so it may all be a bit more problematic.

KernelSlanders
May 27, 2013

Rogue operating systems on occasion spread lies and rumors about me.
What's wrong with the comprehension here? It seems pretty clean.

Cingulate
Oct 23, 2012

by Fluffdaddy
It's two, and they are fairly redundant.

Emacs Headroom
Aug 2, 2003

Cingulate posted:

It's two, and they are fairly redundant.

Yeah there are two, but until there's a nice equivalent in Python for something like Scala's flatmap, I think what you have is clean and readable. First you map your values to the function outputs, then you filter out the None's.

baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

You could do something like

Python code:
def a_function(whatever):
    try:
        # bla bla
    except SomeError:
        return None

results = ((k, a_function(v)) for k, v in input_dict.items())
filtered_results  = ((k, v) for k, v in results if v)
my_dict = dict(filtered_results)
where you're basically building a pipeline of simple one-line generators that filter the output of previous ones

QuarkJets
Sep 8, 2008

KernelSlanders posted:

What's wrong with the comprehension here? It seems pretty clean.

There's nothing wrong with the original pair of comprehensions, it's clean and totally good code. But the thread was asked for alternatives, and that's fun, so we're coming up with alternatives

Cingulate
Oct 23, 2012

by Fluffdaddy
Is there a painless way of moving my primary (3.4) anaconda environment to 3.5? I have a huge bunch of conda packages installed, some from binstar.

Proteus Jones
Feb 28, 2013



Cingulate posted:

Is there a painless way of moving my primary (3.4) anaconda environment to 3.5? I have a huge bunch of conda packages installed, some from binstar.

won't conda update --all do it?

EDIT: No it won't. Sorry I thought they had released 3.5 and that's what you were asking.

Proteus Jones fucked around with this message at 02:06 on Oct 26, 2015

SurgicalOntologist
Jun 17, 2004

conda install python=3.5.* worked for me. But that may have been a bad idea because now conda update --all doesn't work, so I suspect there are some incompatibilities present in the packages I already have installed.

Dominoes
Sep 20, 2007

I don't think there's a way to do what you ask. You can create a Python 3.5 virtual environment with this:
code:
conda create -n py35 python=3.5
If Continuum's on schedule, an official 3.5 release should come out within a few days, at which point you won't need to use a virtual environment.

Cingulate
Oct 23, 2012

by Fluffdaddy
Oh well.

Cingulate
Oct 23, 2012

by Fluffdaddy
Two more beginner questions, should anybody find the time to go through them ...

What's the difference between "&" and "and"? I know I can only use one of them for certain series stuff, but I don't get the principled reason.

I often use list comps to filter. But some terseness is lost when unpacking it to a regular for loop. Is there some magic hack I'm not seeing to make this

code:
for x in sequence if meets_condition(x):
    do a thing
    do another thing
instead of this:

code:
for x in sequence:
    if meets_condition(x):
        do a thing
        do another thing
I know I could loop over [x for x in sequence if meets_condition(x)] instead of over sequence, but that's not a win.

KernelSlanders
May 27, 2013

Rogue operating systems on occasion spread lies and rumors about me.
The operator and is the builtin logical operator you should probably be using unless you're dealing with numpy arrays.

For the loop, you could pre-filter with a generator or list comprehension, although I don't know that it's any clearer than your second example.

Python code:
for x in (y for y in sequence if meets_condition(y)):
    do a thing
    do another thing
or

Python code:
filtered_sequence = [x for x in sequence if meets_condition(x)]
for x in filtered_sequence:
    do a thing
    do another thing

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

Cingulate posted:

Two more beginner questions, should anybody find the time to go through them ...

What's the difference between "&" and "and"? I know I can only use one of them for certain series stuff, but I don't get the principled reason.

I often use list comps to filter. But some terseness is lost when unpacking it to a regular for loop. Is there some magic hack I'm not seeing to make this

code:
for x in sequence if meets_condition(x):
    do a thing
    do another thing
instead of this:

code:
for x in sequence:
    if meets_condition(x):
        do a thing
        do another thing
I know I could loop over [x for x in sequence if meets_condition(x)] instead of over sequence, but that's not a win.

& is a bitwise operation. It means convert the two operands into sequences of bits, and return the integer value that results from &ing the sequences together. Or at least, for integers that's what it does, and in principle if you implement it for your own classes you don't have to return an int, you can return absolutely anything you want and it doesn't have to even have anything to do with bitwise operations. But at least for integers it's bitwise and.

Type this in and try it for some pairs of integers:

code:
def bitwise_and_demo (a, b):
    c = a & b
    a_str = bin(a)[2:]
    b_str = bin(b)[2:]
    c_str = bin(c)[2:]
    maxlen = max(len(a_str), len(b_str), len(c_str))
    a_str = a_str.zfill(maxlen)
    b_str = b_str.zfill(maxlen)
    c_str = c_str.zfill(maxlen)
    print(a_str)
    print(b_str)
    print(c_str)
Compare | (bitwise or), ^ (bitwise exclusive or), << (left shift), >> (right shift).

"and" and "or" on the other hand are logical operators. You should use & and | when you really mean to do operations on bits (which is a numerical calculation rather than a logical proposition), and you should use "and" and "or" when you really mean to express logical conjunction and disjunction ("I want to add VAT if the shipping address is in Texas AND Venus is rising in Capricorn"). To do otherwise is failing to effectively communicate what your code is meant to be doing.

Cingulate
Oct 23, 2012

by Fluffdaddy
Ah okay thanks guys.

I indeed operate a lot on numpy array, or at least stuff that is very similar (pandas series).

Emacs Headroom
Aug 2, 2003
I'm reasonably certain that internally Series objects use numpy arrays.

I use bit-wise and all the time with indexing on Series, say you have something stored in a series with a datetime index and you want the average within a time slice:

code:
s = pd.Series(some_numbers, index=some_dates)
s[(s.index >= datetime.datetime(2015, 10, 25)) & (s.index < datetime.datetime(2015, 10, 26))].mean()
will give you the mean of your entries gathered on the 25th of october

SurgicalOntologist
Jun 17, 2004

If you're just comparing to the index may as well slice:

Python code:
s[datetime(2015, 10, 25):datetime(2015, 10, 26)].mean()

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

Emacs Headroom posted:

I'm reasonably certain that internally Series objects use numpy arrays.

I use bit-wise and all the time with indexing on Series, say you have something stored in a series with a datetime index and you want the average within a time slice:

code:
s = pd.Series(some_numbers, index=some_dates)
s[(s.index >= datetime.datetime(2015, 10, 25)) & (s.index < datetime.datetime(2015, 10, 26))].mean()
will give you the mean of your entries gathered on the 25th of october

This is an example of what I was talking about where whatever these object types are have had the behaviour of & customised to do something the library creator thought would be useful. Now, I don't know anything about this library, but it strikes me as bad. I would prefer to see code that relies less on the reader being intimately familiar with how the & and >= operators are implemented for this library's data types, at the cost of most likely being more verbose.

Adbot
ADBOT LOVES YOU

Cingulate
Oct 23, 2012

by Fluffdaddy

Hammerite posted:

This is an example of what I was talking about where whatever these object types are have had the behaviour of & customised to do something the library creator thought would be useful. Now, I don't know anything about this library, but it strikes me as bad. I would prefer to see code that relies less on the reader being intimately familiar with how the & and >= operators are implemented for this library's data types, at the cost of most likely being more verbose.
Arguably, Numpy is so big it'd be bad to not have it work like this.

  • Locked thread