Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
onionradish
Jul 6, 2006

That's spicy.
You can even drop the list brackets within any() as most functions that take an iterable work with a generator expression too in 3.x:
code:
if any(x in line for x in allowed_strings):
    ...

Adbot
ADBOT LOVES YOU

onionradish
Jul 6, 2006

That's spicy.
As a Python user on Windows, I'd put Christoph Gohlke in the same "I hope nothing ever happens to him" category. His "Unofficial Windows Binaries for Python" site has been a project lifesaver on more than one occasion.

onionradish
Jul 6, 2006

That's spicy.

laxbro posted:

Newbie question: I'm trying to build a web scraper with Beautiful Soup that pulls table rows off of the roster pages for a variety of college sports teams. I've got it working with regular HTML pages, but it doesn't seem to work with what appear to be Angular pages. Some quick googling makes it seem like I will need to use a python library like selenium to virtually load the page in order to scrape the html tables on the page. Would a work around be to first use Beautiful Soup, but if a table row returns as None, then call a function to try scraping the page using something like selenium. Or, should I just try to scrape all of the pages with selenium or a similar library?
As a suggestion, the way I've done this in the past is to write your parsing based on HTML with Beautiful Soup or lxml so it doesn't matter how the HTML is acquired. For sites that can be scraped without Javascript/selenium, the HTML can just be fetched with requests; for those that need selenium, the webdriver instance's page_source attribute will have the final JavaScript-manipulated HTML (as long as you wait for or detect that page loading is complete).

I also typically separate the HTML collection from the parsing, saving the HTML to disk so I can fine-tune the parsing without hitting the server. And if the site layout changes in the future, the captured HTML is available so I can update the parsing rules.

baka kaba posted:

Also Selenium was a real pain last time I used it, they updated something so the Firefox webdriver wasn't included anymore and everything broke. There might be a better scraper out there?

Setting up selenium now involves an extra step. It isn't hard, but seems to be really poorly described. Essentially, you need a middleware geckodriver in addition to selenium, and either specify the path to that driver when initializing the webdriver or just add it to your PATH. I've always just added it to PATH, but the other way is supposed to be as simple as this:

code:
from selenium import webdriver
driver = webdriver.Firefox(executable_path=r'your\path\geckodriver.exe')
driver.get('http://inventwithpython.com')
For selenium, you can optionally set up the Firefox QuickJava Extension to selectively disable loading/processing of images, Flash, CSS, Java and other things that slow down scraping. Here's an excerpt from one of my scripts that uses that plugin to initialize the driver and blocks some of the Firefox "welcome" screens.:

code:
class Browser(webdriver.Firefox):
    """
    Wrapper for selenium.webdriver.Firefox browser that adds custom
    methods

    :param use_javascript: True to allow browser to use Javascript
    """

    def __init__(self, use_javascript=False):
        profile = webdriver.FirefoxProfile()
        profile.set_preference("javascript.enabled", use_javascript)

        # disable annoying firefox "welcome/startup" page
        # http : //stackoverflow.com/questions/33937067/
        #   firefox-webdriver-opens-first-run-page-all-the-time#34622056
        profile.set_preference("browser.startup.homepage", "about:blank")
        profile.set_preference("startup.homepage_welcome_url", "about:blank")
        profile.set_preference("startup.homepage_welcome_url.additional", "about:blank")

        fx_to_disable = ['Images', 'AnimatedImage', 'CSS', 'Cookies', 
                         'Flash', 'Java', 'Silverlight']
        disable_fx(profile, fx_to_disable)

        super(Browser, self).__init__(profile)

def disable_fx(firefox_profile, fx_to_disable):
    valid_fx = {'Images', 'AnimatedImage', 'CSS', 'Cookies', 'Flash', 'Java', 
                'JavaScript', 'Silverlight', 'Proxy'}
    
    try:
        # QUICKJAVA_XPI_PATH set to filepath of QuickJava XPI file from
        # https : //addons.mozilla.org/en-us/firefox/addon/quickjava/
        #   curVersion
        firefox_profile.add_extension(QUICKJAVA_XPI_PATH)
	# prevents loading the 'thank you for installing screen'
        firefox_profile.set_preference(
            "thatoneguydotnet.QuickJava.curVersion", "2.0.6.1") 
    except (NameError, IOError):
        return firefox_profile

    fx_to_disable = [
        'thatoneguydotnet.QuickJava.startupStatus.{}'.format(fx)
        for fx in fx_to_disable if fx in valid_fx
    ]
    
    for fx in fx_to_disable:
        firefox_profile.set_preference(fx, 2)
    
    return firefox_profile
EDIT: To clarify on the QuickJava plugin, I downloaded (or installed and copied) the XPI file and specify the path in QUICKJAVA_XPI_PATH. The version string in disable_fx() will need to match whatever version you use if you want to block the "thanks for installing" screen.

onionradish fucked around with this message at 15:40 on Apr 23, 2017

onionradish
Jul 6, 2006

That's spicy.

underage at the vape shop posted:

So I'm doing some rss stuff for Uni. Howcome sometimes the formatting of quotes gets changed and sometimes it doesn't?
Since RSS is XML, in order to include HTML in the entries, they're escaped. The first link is encoded using CDATA, the second one just does HTML entity escaping.

If you're allowed to use third-party libraries, use feedparser anytime you have to deal with RSS. It handles all the complexity of the RSS variants including namespaces and gives you a standardized dictionary.

If not, you'll need to detect which escaping method has been used and use the stdlib to unescape those like the second type.

In python 2, it's:
code:
from HTMLParser import HTMLParser
clean = HTMLParser().unescape(html_to_unescape)
In python 3, it's:
code:
import html
clean = html.unescape(html_to_unescape)

onionradish
Jul 6, 2006

That's spicy.
I'm trying to set up testing for a class that used to be a namedtuple. Previous tests could simply compare equality of the namedtuples:

Python code:
# Testing namedtuples with pytest
def test_data_thing():
    # return a MyCustomThing namedtuple from data
    result = module.make_mycustomthing(data)

    # create namedtuple with expected values for comparison
    expected = module.MyCustomThing(a='a', b='b')

    # compare values of the MyCustomThing namedtuples
    assert result == expected
Now that MyCustomThing is a class, that won't work. The class also now has private attributes that shouldn't be compared.

I've added a method to dump out public attributes as a dict and compare against that, which works, but I'm not sure that's the right solution. Is there a better or 'best practices' way set up the class or my tests to compare the values that matter? I could write separate tests for each attribute, but that seems worse and will needlessly increase the number of tests.

Python code:
# Testing class attributes with pytest
MyCustomThing:
    ...
    def as_dict(self):
        """Export public vars as dict for testing"""
        d = dict(vars(self))
        return {k:v for k, v in d.items() if not k.startswith('_')}
    
def test_data_thing():
    # get a MyCustomThing instance from data
    result = module.MyCustomThing(data)

    # create dict with expected values for comparison
    expected = dict(a='a', b='b')

    # compare values of the MyCustomThing instance attributes
    assert result.as_dict() == expected

onionradish
Jul 6, 2006

That's spicy.
I'd considered __eq__, and actually used it during the "upgrade" from the namedtuple to the class. Then, once the class __init__ changed, that broke and I needed to rethink. (I've been looking at the code for too long, so likely not thinking clearly.)

Is something like this a reasonable implementation to enable the comparison of the attributes? It works, but does it "smell"?
Python code:
class MyCustomThing:
    def _public_attribs(self):
        return {k:v for k,v in vars(self).items() if not k.startswith('_')}

    def __eq__(self, other):
        """Test equality with same type or dict"""
        if isinstance(other, self.__class__):
            return self._public_attribs() == other._public_attribs()
        elif isinstance(other, dict):
            return self._public_attribs() == other
        else:
            return False  # MARTHA!!!

onionradish fucked around with this message at 21:55 on Jul 21, 2017

onionradish
Jul 6, 2006

That's spicy.
Ooo .. I like the contract version! Thanks!

EDIT: Why return NotImplemented vs. raise NotImplementedError()?

onionradish fucked around with this message at 22:44 on Jul 21, 2017

onionradish
Jul 6, 2006

That's spicy.

Eela6 posted:

As far as NotImplemented vs NotImplementedError goes, read the docs at https://docs.python.org/3/library/constants.html or Ch. 13 of Fluent Python: Operator Overloading.
I'd never seen that before. I learned something new! Thanks again.

EDIT: a follow up... in your "go crazy" example, you omit the hasattr(other, "_public_attribs") test inside __eq__. Is it not necessary?

onionradish fucked around with this message at 23:24 on Jul 21, 2017

onionradish
Jul 6, 2006

That's spicy.

Eela6 posted:

Check this out.

0. self.__eq__(other) -> self._public_attribs == other
1. self._public_attribs.__eq__(other) is NotImplemented.
2. Fallback to other.__eq__(self._public_attribs) -> other._public_attributes == self.public_attribs.
3. These are both dictionaries, so they compare 'normally'.

Python is cool.
:psyboom:

My melon is officially twisted. I'm leaving the hasattr comparison in for explicitness, but I'm digging how slick that is.

Eela6 posted:

Python is cool.
:agreed:

onionradish fucked around with this message at 23:59 on Jul 21, 2017

onionradish
Jul 6, 2006

That's spicy.

Seventh Arrow posted:

So my attempts to find a tutor were not very fruitful. So maybe I could get a bit of direction instead.

I'm in a data science / data engineering course, but the level of python expertise required is higher than I initially thought.

I've read "Python Crash Course" and am working my way through "Automate the Boring Stuff with Python" so I know what lists and dictionaries are, etc., but I draw a blank when trying to come up with my own code and analyzing scripts is tricky - like here's an example of a script that I need to modify for a project. I eventually figured it out, but it was pretty challenging.

So anyways, books kinda take longer than I'd like, is there maybe an online course that's fairly decent? It doesn't have to be free. I think I just need to start doing more exercises, but I also still need training wheels somewhat.
I recently ran across PySlackers, a group of ~8500 Python users of various skils that seem to actively seek to help new Python users through Slack.

I'm not a member, so can't speak to how interactions go, but might be a consideration since you're likely to find others with a data science background who can answer questions, critique code, or point you toward relevant resources they found useful.

There are some learning resources on their Github page including some video tutorials.

Mod Edit: :iiam:

Somebody fucked around with this message at 22:21 on Sep 7, 2017

onionradish
Jul 6, 2006

That's spicy.
I want to add multi-threading to a basic webscraper I've been tasked with. I have a list of URLs to spread across threads, but don't want to hit the same host simultaneously.

With a list of URLs, some from the same host, some from different hosts, what's the best way to set up thread Queue()s or some other URL pool so each thread can do simultaneous downloads as long as they're from different hosts?

This seems like something simple, and something that would be in stdlib collections or itertools, but I'm not seeing it. If it's actually a tricky issue, that's fine, and I'll work on a solution -- I just don't want to re-invent the wheel.

onionradish
Jul 6, 2006

That's spicy.
Speaking of OAuth, is there an easy way or recommended module that will allow a Windows PC to respond to the callback to get a token after passing in the initial key to an arbitrary service?

I just want to authorize access for a simple personal script on my home PC. The OAuth hassle to set up a public server vs using a simple authentication key is pushing a bunch of "I could automate that" projects to "UGH; maybe some other day..."

Adbot
ADBOT LOVES YOU

onionradish
Jul 6, 2006

That's spicy.
Sorry for the ambiguity of my OAuth question earlier. Specifically, I'm wanting to access the Pocket API from my Windows desktop to pull all my "read or file this later" bookmarks I've dropped in there from my phone while traveling.

The part I'm scratching my head about is Step 2 and onwards of the Pocket authentication process -- getting a request token at the request_uri, since the calling app is a Python script on my home Windows PC and isn't a publicly-accessible address. That request_uri gets used in a couple of places downstream through the process.

One of the Pocket API packages I found on PyPi suggests some random person's website in Germany to send your callback to. Another one does it's own redirect to an obfuscated/shortened goo.gl link. Running authorization through either of those sounds like a terrible idea that gives credentials for complete access to your account to some unknown entity.

I'll look at Zapier; I hadn't considered that as a possiblity. If I'm over-complicating this or misunderstanding, I'd greatly appreciate advice. All of the other APIs I've worked with only need a simple "consumer key" passed through with the request so the OAuth stuff is new and I've been putting off learning how to interact with it since it's just home hobby stuff.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply