Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

Selenium has a subset of BS's functionality, but it also literally runs in a live browser, so it can work with pages that require javascript to run, or need to be interacted with to load the stuff you want, or otherwise have some awkward workflow. It's also able to do actual automation for testing and the like, as well as basic scraping

BS is more powerful and expressive (I think, been a while since I used Selenium) but it does rely on you getting the HTML you need to scrape. It has some basic networking stuff, or you can use something like Requests for more power, but some sites just don't play nice, and it can be a lot less hassle to have Selenium run an actual full browser to get the page loaded

Adbot
ADBOT LOVES YOU

SurgicalOntologist
Jun 17, 2004

An often overlooked alternative is to investigate the site's requests using dev tools (particularly the network tab). Sometimes there's a url you can hit for JSON data and not have to parse any HTML. Especially if you find yourself needing JavaScript to get the data--it has to come from somewhere.

QuarkJets
Sep 8, 2008

QuarkJets
Sep 8, 2008

Python classes are cool cause everything in Python is cool, oh you want to add another attribute to this object sure whatever you want bub there you go

what you want to use setters/getters but want it to look like you're not just accessing the attribute directly, okay whatever you say boss

keyframe
Sep 15, 2007

I have seen things

baka kaba posted:

Selenium has a subset of BS's functionality, but it also literally runs in a live browser, so it can work with pages that require javascript to run, or need to be interacted with to load the stuff you want, or otherwise have some awkward workflow. It's also able to do actual automation for testing and the like, as well as basic scraping

BS is more powerful and expressive (I think, been a while since I used Selenium) but it does rely on you getting the HTML you need to scrape. It has some basic networking stuff, or you can use something like Requests for more power, but some sites just don't play nice, and it can be a lot less hassle to have Selenium run an actual full browser to get the page loaded

cheers guys.

Sounds like its best to learn both.

Gothmog1065
May 14, 2009

baka kaba posted:

If it's not clear, Python's a dynamic language where you can just assign attributes and functions to objects whenever you like. So you can take any thing and go bitmap.butts = 101 or whatever you like - under the hood there's a local namespace with a dictionary of attributes and functions, and you can add and remove from that however you want

So when you define a class, you can set attributes on the class object itself - you're adding to the class's dictionary, so any instances can see that stuff in a higher scope. So that works as a class variable - instances can reference it, and you can just reference it as an attribute on the class itself if you like, MyClass.x

What you're doing in the __init__ constructor is taking the instance itself (passed in as a parameter called self by convention) and just assigning attributes to that object. So when you do self.x = 69 you're adding that attribute to that object's local dictionary, which only affects that instance. So that basically works as an instance variable. There's nothing special about it - you're just adding that property to that object that was passed in

The maybe weird thing is that you define all these functions in the class as def whatever(self, x, y), but when you call the method on an instance, you just do thing.whatever(x, y). Under the hood, it rewrites the call and does MyClass.whatever(thing, x, y) - it calls the function in the class object, and passes in the instance object, so you can mess with it in the body of the function. That's why they all have that self parameter - so they can reference and affect the actual instance. The language sugar takes care of rewriting those calls for you, but in the function itself you have to work with the instance parameter explicitly

So just to get my puny little mind over the semantics:

code:
class Class:
    def __init__(self, name):
        self.name = name
    name = "Whee"

test = Class("Foo")
print test.name
print Class.name

test2 = Class("Bar")

>>>
Foo
Whee
So basically when looking at those print statements, the first print statement is looking at Class.test.name and the second is looking at Class.name. So to directly call the class variable 'name' you would have to Class.name.

I'm manually converting and rewriting a function based script to a class based script, so the main reason for the questions is so when I'm rewriting, I needed to make sure that I knew that within a class, name != self.name.

QuarkJets
Sep 8, 2008

I think it's more accurate to say that the first is printing test.name, where test is an instance rather than a class.

And if you didn't define self.name, then test.name would refer to the class variable

Foxfire_
Nov 8, 2010

You can look at the dictionaries themselves and see where the entries are:

code:
class Class:
    """ I'm a docstring! """
    def __init__(self, name):
        self.name = name
    name = "Whee"

test = Class("Foo")
test2 = Class("Bar")

print("Class's dict:")
for key, value in Class.__dict__.items():
    print("{0:>15}: {1}".format(key, value))
print()
print()
print("test's dict:")
for key, value in test.__dict__.items():
    print("{0:>15}: {1}".format(key,value))
print()
print()
print("test2's dict:")
for key, value in test2.__dict__.items():
    print("{0:>15}: {1}".format(key,value))
code:
Class's dict:
     __module__: __main__
    __weakref__: <attribute '__weakref__' of 'Class' objects>
       __dict__: <attribute '__dict__' of 'Class' objects>
       __init__: <function Class.__init__ at 0x7f76739a9158>
           name: Whee
        __doc__:  I'm a docstring! 


test's dict:
           name: Foo


test2's dict:
           name: Bar
When you access thing.whatever, it searches for an entry for "whatever" in thing.__dict__, then searches thing.__class__.__dict__ if it didn't find anything*


*this is a little bit a lie. There are some other names that are somewhere else. That's why you don't see an entry for "__class__" in test and test2's dicts even though you can access "test.__class__" without getting an error. And there's a few more places that get searched for base clases.

Gothmog1065
May 14, 2009

QuarkJets posted:

I think it's more accurate to say that the first is printing test.name, where test is an instance rather than a class.

And if you didn't define self.name, then test.name would refer to the class variable

So test.name will look for the 'test' instance of the name first, and if it doesn't exist, then fall back to the class variable?

Also, thanks for answering all these nitpicky questions, this really helps me grasp class variable inheritances much more cleanly. This isn't directly related to my code, but may be of help in the future, and greatly helps my understanding of classes in general.

CarForumPoster
Jun 26, 2013

⚡POWER⚡
TLDR: I get ValueError: ('cannot index with vector containing NA / NaN values', 'occurred at index 0') but I dont think I have an NA values.

I have code that wants to look up a string from df["StatCat"] in ArrestTable and return whether its a misdemenor, felony, other in the column "Lev". It will do eventually this for several other columns as well. Because the records arent always exacting in writing the statute number thats in StatCat, it needs to be a fuzzy logic match.

code:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

def find_law(row):
    abbrev = process.extractOne(df_raw["StatCat"][row],choices=ArrestTable["StatuteCat"],score_cutoff=80)
    if abbrev:
        #print(df_raw["StatCat"][row])
        #print(abbrev[0])
        return ArrestTable[ArrestTable['StatuteCat'] == abbrev[0]]["Lev"].item()
    return np.nan
I do some regex/string manipulation to get them in a common format to lookup.
code:
df_raw[['StatuteNum','StatSub']] = df_raw["Statute"].str.split('\(|\)', expand=True, n=1).iloc[:,[0,1]]
df_raw["StatSub"] = df_raw["StatSub"].str.replace("(","")
df_raw["StatSub"] = df_raw["StatSub"].str.replace(")","")
df_raw["StatSub"] = df_raw["StatSub"].str.lower()
df_raw["StatCat"] = df_raw["StatuteNum"].map(str) + "&" + df_raw["StatSub"] 
I tried to fillna so that the wouldnt be any na values...
code:
df_raw["StatSub"] = df_raw["StatSub"].fillna("")
df_raw["StatuteNum"] = df_raw["StatuteNum"].fillna("")
df_raw["StatCat"] = df_raw["StatCat"].fillna("")

#df_raw["StatLev"] = df_raw["StatCat"].map(ArrestTable.set_index('StatuteCat')["Lev"]) #This doesnt work because due to minor text differences in the staute number, it returns NA ~ 25% of the time.
df_raw["StatLev"] = df_raw.apply(find_law, axis=1)

df_raw[180:190].T
I get a ValueError:
code:

ValueError: ('cannot index with vector containing NA / NaN values', 'occurred at index 0')

---> 12 df_raw["StatLev"] = df_raw.apply(find_law, axis=1)
[...]
      4 def find_law(row):
----> 5     abbrev = process.extractOne(df_raw["StatCat"][row],choices=ArrestTable["StatuteCat"],score_cutoff=80)
EDIT: This works fine, though is slow:
code:
for i in range(20):
    tempvar = find_law(i)
    print(tempvar)

CarForumPoster fucked around with this message at 15:34 on Dec 12, 2018

baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

Gothmog1065 posted:

So test.name will look for the 'test' instance of the name first, and if it doesn't exist, then fall back to the class variable?

Also, thanks for answering all these nitpicky questions, this really helps me grasp class variable inheritances much more cleanly. This isn't directly related to my code, but may be of help in the future, and greatly helps my understanding of classes in general.

I dunno if you have any experience with other languages, but defining a variable on a class is basically setting an attribute that's inherent to all instances of that class. So if you change the value of that attribute, all instances are affected because it's a shared 'trait' if you like

Python code:
class Dog:
    is_good = True

    def __init__(self, name):
        self.name = name
so here, we define a Dog and set this is_good attribute on the Dog class itself. Every instance of Dog you create will see that attribute. All dogs are good! And if you set that to False, then all dogs will be bad, you monster

But in __init___ we're setting a name variable, on the instance itself. Like how you can define local variables inside a function, that are only visible in that scope, you're setting a value on the instance's local dictionary of variables. So you can create two dogs with different names, and that name assignment will happen locally for each of them - and you can change the names, or even delete them, and it'll only happen to the instance you're doing that to. The two instances are both Dogs, but you're not changing a shared variable in the Dog class, you're messing with their local, independent attributes in each instance

The whole "local scope in a function" thing is really how it all works, and that's what Foxfire_ is getting at. The class variables are like a higher, 'global' scope, but the instance objects have their own local scope too. If you try to read dog_instance.whatever it'll first check the local scope in dog_instance to see if that variable is defined, and if not it'll go up the chain to the Dg class to see if that has this particular variable named. (And if it doesn't, and the class inherits from another class, it can go up the chain to see if it's defined anywhere in the hierarchy)

When you assign a value, you're writing it to the local dictionary of whatever you're doing the assignation on. If you do it to the instance, you're creating a local value, so when you try and read it from the instance you'll immediately find that local value and get that back - it won't need to go looking in the class or any of its parents. This means you can shadow a variable in a higher scope - basically overriding it with another variable with the exact same name that will get read instead. You're not changing that attribute in the higher scope, you're creating a new local one that will take precedence (and in some situations your IDE will warn you about this)

Python code:
>>> wishbone = Dog('Wishbone')
>>> rude_dog = Dog('Rude Dog')
>>> rude_dog.is_good = False
>>> rude_dog.is_good
False
>>> wishbone.is_good
True
it might look like I'm setting the class variable is_good to false, but I'm not - because that variable is being set on the rude_dog object, it's creating a new local variable, which is what's being read. But I haven't set one on wishbone, so that looks up the hierarchy to the class, which does have a variable with that name, so that's what gets returned.

If you want to change the value in Dog, you gotta do it explicitly in the same way
Python code:
>>> Dog.is_good = False
>>> wishbone.is_good
False
>>> rude_dog.is_good
False
but again, rude_dog has its own variable set that happens to be called is_good, which is what's being referenced there. They both happen to be set to False now but they're completely independent variables that happen to share the same name. So you need to be careful if you're doing this sort of thing

Also notice that the Dog class doesn't have a name attribute. The __init__ block assigns one to each instance, so it's present in each of their local dictionaries, but the class itself has no knowledge of it. It's a subtle difference, because every instance of Dog has a value for this (unless you deleted it for some reason), but it's not a variable that's present in the class. You can think of it like a lack of a default, if you like

:words:

baka kaba fucked around with this message at 18:35 on Dec 12, 2018

cinci zoo sniper
Mar 15, 2013




CarForumPoster posted:

TLDR: I get ValueError: ('cannot index with vector containing NA / NaN values', 'occurred at index 0') but I dont think I have an NA values.

I have code that wants to look up a string from df["StatCat"] in ArrestTable and return whether its a misdemenor, felony, other in the column "Lev". It will do eventually this for several other columns as well. Because the records arent always exacting in writing the statute number thats in StatCat, it needs to be a fuzzy logic match.

code:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

def find_law(row):
    abbrev = process.extractOne(df_raw["StatCat"][row],choices=ArrestTable["StatuteCat"],score_cutoff=80)
    if abbrev:
        #print(df_raw["StatCat"][row])
        #print(abbrev[0])
        return ArrestTable[ArrestTable['StatuteCat'] == abbrev[0]]["Lev"].item()
    return np.nan
I do some regex/string manipulation to get them in a common format to lookup.
code:
df_raw[['StatuteNum','StatSub']] = df_raw["Statute"].str.split('\(|\)', expand=True, n=1).iloc[:,[0,1]]
df_raw["StatSub"] = df_raw["StatSub"].str.replace("(","")
df_raw["StatSub"] = df_raw["StatSub"].str.replace(")","")
df_raw["StatSub"] = df_raw["StatSub"].str.lower()
df_raw["StatCat"] = df_raw["StatuteNum"].map(str) + "&" + df_raw["StatSub"] 
I tried to fillna so that the wouldnt be any na values...
code:
df_raw["StatSub"] = df_raw["StatSub"].fillna("")
df_raw["StatuteNum"] = df_raw["StatuteNum"].fillna("")
df_raw["StatCat"] = df_raw["StatCat"].fillna("")

#df_raw["StatLev"] = df_raw["StatCat"].map(ArrestTable.set_index('StatuteCat')["Lev"]) #This doesnt work because due to minor text differences in the staute number, it returns NA ~ 25% of the time.
df_raw["StatLev"] = df_raw.apply(find_law, axis=1)

df_raw[180:190].T
I get a ValueError:
code:

ValueError: ('cannot index with vector containing NA / NaN values', 'occurred at index 0')

---> 12 df_raw["StatLev"] = df_raw.apply(find_law, axis=1)
[...]
      4 def find_law(row):
----> 5     abbrev = process.extractOne(df_raw["StatCat"][row],choices=ArrestTable["StatuteCat"],score_cutoff=80)
EDIT: This works fine, though is slow:
code:
for i in range(20):
    tempvar = find_law(i)
    print(tempvar)

I don't understand why your find_law function just addresses globals directly. Or how exactly it is meant to work, to be clear, I''m failing to even replicate a ValueError there. Here's a cleaner example that works and hopefully does what I assume you intended to achieve.

Python code:
from fuzzywuzzy import process
import pandas as pd

test = pd.DataFrame()
test["category"] = ["fooooooooo", "baaaaaaaar", "baaaaaaaaz"]
test["Lev"] = ["foo", "bar", "baz"]


def find_law(row, ref):
    abbrev = process.extractOne(row, choices=ref["category"], score_cutoff=80)
    return None if not abbrev else ref.loc[ref.category == abbrev[0], "Lev"].item()


warcrimes = pd.DataFrame()
warcrimes["grammar"] = ["foooooooo0", "baaaaaaa4r", "6aaaaaaaaz"]
warcrimes["article"] = warcrimes.grammar.apply(lambda x: find_law(x, test))

SurgicalOntologist
Jun 17, 2004

Instead of df_raw["StatCat"][row], try row["StatCat"] inside the function you're applying. The point of apply is that the looping is done internally so you shouldn't have to index back into the dataframe. The row that is passed to apply is not the row index, it's the actual row data as a series.

cinci zoo sniper
Mar 15, 2013




SurgicalOntologist posted:

Instead of df_raw["StatCat"][row], try row["StatCat"] inside the function you're applying. The point of apply is that the looping is done internally so you shouldn't have to index back into the dataframe. The row that is passed to apply is not the row index, it's the actual row data as a series.

Yeah I was wondering if there is some indexing magic going on that can actually work like in that example or not.

keyframe
Sep 15, 2007

I have seen things
Hi guys, had another question:

I was practicing web scraping some more and decided to practice iterating through the pages of a web comic and saving the image files locally to my hd. I have the code below that gets the link to the penny arcade comic link:

Python code:

url = 'https://www.penny-arcade.com/comic/2018/12/07'

r = requests.get(url)

r.raise_for_status()

soup = BeautifulSoup(r.text, 'html.parser')

elist = []

for i in soup.find_all('img'):

    elist.append(i['src'])


print (elist[0])

# element from inspect is <img src="https://photos.smugmug.com/" \
#     Comics/Pa-comics/n-xmQS5/i-XZhxqqp/0/2100x20000/i-XZhxqqp-2100x20000.jpg" 
#     alt="The Orb In All Of Us" width="1050">

I saw that i['src'] trick from a post in stack overflow but I have no idea why it works. What exactly are we asking/getting with ['src'] there? Up till now I only seen brackets used to get list indexes so I am not sure what this one is doing. Is it a beautiful soup thing? Running it gives me the 'https://photos.smugmug.com/Comics/Pa-comics/n-xmQS5/i-r3gZvBb/0/2100x20000/i-r3gZvBb-2100x20000.jpg' link without all the "img alt="Juice The Youth" src=" stuff in front of the image link which is what I want but I would love to know why it works. :)

Thanks so much for any help!

baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

thing[key] is a conventional Python way of accessing an item - the places you're most likely to see it are in things like dictionaries (passing in a key to get a value) and lists (passing in an index). It's done by implementing a couple of special functions in a class, __getitem__ and __setitem__, which pass in the key and do whatever to return the appropriate result

So BeautifulSoup is giving you an object (a <class 'bs4.element.Tag'> in this case) that implements those methods, so you can do easy key/value lookups as though it's a dictionary, so it's a very pythony way of working with stuff. It's allowing you to access the HTML attributes on the tag by name, by taking the name as a key. That's how it works under the hood anyway - it's extremely worth looking at the vast documentation to see what other methods you can use to wrangle stuff, it's its own thing to learn and you can do a lot!

keyframe
Sep 15, 2007

I have seen things

baka kaba posted:

thing[key] is a conventional Python way of accessing an item - the places you're most likely to see it are in things like dictionaries (passing in a key to get a value) and lists (passing in an index). It's done by implementing a couple of special functions in a class, __getitem__ and __setitem__, which pass in the key and do whatever to return the appropriate result

So BeautifulSoup is giving you an object (a <class 'bs4.element.Tag'> in this case) that implements those methods, so you can do easy key/value lookups as though it's a dictionary, so it's a very pythony way of working with stuff. It's allowing you to access the HTML attributes on the tag by name, by taking the name as a key. That's how it works under the hood anyway - it's extremely worth looking at the vast documentation to see what other methods you can use to wrangle stuff, it's its own thing to learn and you can do a lot!

Thank you for the explanation. I remembered to do a type(i) in the loop after posting the question which gave me the bs4.element.tag. Reading up the BS docs on tags now and it cleared it up a bunch.

This stuff is fun and frustrating at the same time. :unsmith:

baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

Yeah BS is kinda overwhelming at first, and there's more than one way of doing things too

Personally I like using CSS selectors, which is its own syntax to learn but it can be pretty neat. But yeah you still have to pull out the data you need once you have the elements you wanted!

btw, you're probably ending up with a lot of images just doing find_all - if you inspect the page, the comic pic is inside a div with an id called comicFrame, so if you can grab that element and then do a find call for img tags on that (instead of the whole document) you'll target exactly what you need. (Or using CSS selectors, soup.select_one('#comicFrame img'))

Gothmog1065
May 14, 2009

baka kaba posted:

I dunno if you have any experience with other languages, but defining a variable on a class is basically setting an attribute that's inherent to all instances of that class. So if you change the value of that attribute, all instances are affected because it's a shared 'trait' if you like

:words:

No, no real experience with other languages (hence my obvious dumb questions). This has been incredibly helpful in my coding experiences.

baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

I was just gonna draw some parallels to Java if that helped!

*record scratch*

Feral Integral
Jun 6, 2006

YOSPOS

keyframe posted:

Hi guys, had another question:

I was practicing web scraping some more and decided to practice iterating through the pages of a web comic and saving the image files locally to my hd. I have the code below that gets the link to the penny arcade comic link:


I have a lot of page scraping experience from my previous job. I trained a few people in my time there and the absolute fastest way to get someone comfortable with selectors/xpaths etc is starting a selenium browser window in the interpreter. Now use the browser's inspection tool to figure out the data you want to access (right click on something your interested in on the page->inspect)*. Next, go back to the interpreter console and try out your xpaths and adjust them until you get what you want. Once you have the xpath figured out you can just plug that into BS or do whatever you want with it.

Hope I explained that well enough, let me know if you need more detail

*edit one thing I should note: if you are using selenium browser like this, make sure you navigate to the url you are interested in using the interpreter command line, not the selenium browser URL bar

As a suggestion for somebody new learning xpaths, its just more visual i guess. Also, using the interpreter allows you to pretty much code as you go, so you wouldn't have to find your xpath, copy it into your code, run it and see the result. You could just boom do it right in the interpreter, move on to the next thing and copy the appropriate lines from your interpreter history when you're finished.
VV

Feral Integral fucked around with this message at 20:03 on Dec 13, 2018

baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

How come you use Selenium for that? I usually just inspect in a normal browser like Chrome, you can select an element to see the hierarchy at the bottom, and do ctrl+F to type a selector or XPath and see if it highlights the right thing

CarForumPoster
Jun 26, 2013

⚡POWER⚡

SurgicalOntologist posted:

Instead of df_raw["StatCat"][row], try row["StatCat"] inside the function you're applying. The point of apply is that the looping is done internally so you shouldn't have to index back into the dataframe. The row that is passed to apply is not the row index, it's the actual row data as a series.

cinci zoo sniper posted:

Yeah I was wondering if there is some indexing magic going on that can actually work like in that example or not.

This was the issue. Thank you both.

The new issue is that the fuzzy logic function is hilariously slow. I need to run it 1M rows. It takes 2m 30s to run on 1000 rows.

I'm going to try to figure out something with .map and str.contains or some other solution that ends up being "good enough"

SurgicalOntologist
Jun 17, 2004

Do you have python-levenshtein installed? IIRC it's not a requirement of fuzzywuzzy but if it's available it will run faster. Other than that I would just bite the bullet and run it over the weekend, assuming you only need to run it once.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

SurgicalOntologist posted:

Do you have python-levenshtein installed? IIRC it's not a requirement of fuzzywuzzy but if it's available it will run faster. Other than that I would just bite the bullet and run it over the weekend, assuming you only need to run it once.

Yes, still slow as balls and I'm doing EDA with a large, new data set so I tend to run it frequently.

I made some improvements to the lookup table and now am up to 20% NaNs instead of 40% NaNs and it runs lightning fast, but still misses anything that's not an exact match.

code:
df_raw["StatLev"] = df_raw["StatCat"].map(ArrestTable.set_index('StatuteCat')["Lev"])
e.g. the string I have is 456.023&1
the lookup table has
456.023&1a
456.023&1b


But I am fine with returning the first one that it matches with, as its accurate enough, so fuzzywuzzy was returning the result after matching on 456.023&1a

CarForumPoster fucked around with this message at 22:54 on Dec 13, 2018

cinci zoo sniper
Mar 15, 2013




CarForumPoster posted:

Yes, still slow as balls and I'm doing EDA with a large, new data set so I tend to run it frequently.

I made some improvements to the lookup table and now am up to 20% NaNs instead of 40% NaNs and it runs lightning fast, but still misses anything that's not an exact match.

code:
df_raw["StatLev"] = df_raw["StatCat"].map(ArrestTable.set_index('StatuteCat')["Lev"])
e.g. the string I have is 456.023&1
the lookup table has
456.023&1a
456.023&1b


But I am fine with returning the first one that it matches with, as its accurate enough, so fuzzywuzzy was returning the result after matching on 456.023&1a

If your target-catalog are all written like XXX.YYY&something then I would drop fuzzy match altogether and just create a crude index manually, by doing a “string starts with” filtering until the special symbol. That should be comically faster, and I assume there’s somewhere a library interpretation for that ironed out.

keyframe
Sep 15, 2007

I have seen things
Guys I am so loving stoked, wrote my first "I did it on my own" code in Python since I started learning it a month ago and it works. Thank you all who helped, you guys are awesome. It scrapes a year worth of comics from penny arcade and saves it to a folder. Not necessarily a fan of PA but they had that 2018/xx/xx format at the end of the url which presented a fun challenge to solve. I will post my code for you guys to laugh at. :unsmith:


Python code:

import requests
from bs4 import BeautifulSoup
import os
import urllib.request


#//Penny Arcade Comic Scraper


os.chdir('I:\\Scrape')

url = 'https://www.penny-arcade.com/comic/'

def pa_urlDate():

    #//generate the 2018/xx/xx format to add to the end of the url


    month = list(range (1,13))

    day = list(range(0,32))

    ret_lst = []



    for i in range(len(month)):

        if month[i] < 10:


            for d in day:


                if day[d] < 10:

                    ret_lst.append (url + '2017/0{}/0{}'.format (month[i],day[d]))

                else:
                    ret_lst.append (url + '2017/0{}/{}'.format (month[i],day[d]))

        else:

            for d in day:

                if day[d] <10:

                    ret_lst.append (url + '2017/{}/0{}'.format(month[i], day[d]))

                else:

                    ret_lst.append (url + '2017/{}/{}'.format(month[i], day[d]))

    return ret_lst


url_list = pa_urlDate()  #// iterate this for the url end dates

for i in range(len(url_list)):

    r = requests.get(url_list[i])

    pa_soup = BeautifulSoup(r.text, 'html.parser')

    check404 = pa_soup.find_all('h2')



    if check404[0].text == '404':

        continue

    else:

        imgLoc = pa_soup.find_all('img')[0]['src']

        urllib.request.urlretrieve(imgLoc, 'pa{}.jpg'.format(str(i)))

KICK BAMA KICK
Mar 2, 2009

Never done any multiprocessing before. Have a problem that amounts to polling for some input, then upon getting some, comparing it to each item in a pre-existing database with an expensive calculation. After my basic research into the multiprocessing module I cobbled together a solution in the form of splitting the database into n chunks and spawning a process to evaluate the input against each of those n chunks and then compiling the n results. This appears to work as expected -- using 3 cores is a little less than 3 times as fast as doing it in one process, and I get the expected result. Is this the correct pattern? Like I'm happy with what's happening but I genuinely did not understand what I read about multiprocessing.Pool or the other stuff in that module so I'm not sure if there was a better way to do what I'm doing.
code:
chunk_size = total / n_processes

def _work_chunk(input, pipe, i):
    db = get a new database connection, read-only
    for row in db.execute(get the ith chunk_size rows):
        do expensive calculations on the input vs this row
    pipe.send(results)

def work_all(input):
    pipes, jobs = [], []
    for i in n_processes:
        out_pipe, in_pipe = mp.Pipe(False)
        job = mp.Process(target=_work_chunk, args=(input, in_pipe, i)
        pipes.append(out_pipe)
        jobs.append(job)
        job.start()
        
    for job in jobs:
        jobs.join()
    
    all_results = whatever
    for pipe in pipes:
        this chunk's result = pipe.recv()
        add that to all_results
    return all_results
e: Like I can't tell is this cleaner with concurrent.futures.ProcessPoolExecutor or is a sqlite3 query even iterable in that way?

KICK BAMA KICK fucked around with this message at 08:14 on Dec 14, 2018

Furism
Feb 21, 2006

Live long and headbang
I'm a decent C#/TS hobby developer and I need to get into Python for working on some automation stuff at work. Are the courses in the OP still the recommended ones for learning? I don't need to be taught what a function or variable, only any Python-specific concepts (if any) and how to get poo poo done quickly. Any recommendation?

QuarkJets
Sep 8, 2008

KICK BAMA KICK posted:

Never done any multiprocessing before. Have a problem that amounts to polling for some input, then upon getting some, comparing it to each item in a pre-existing database with an expensive calculation. After my basic research into the multiprocessing module I cobbled together a solution in the form of splitting the database into n chunks and spawning a process to evaluate the input against each of those n chunks and then compiling the n results. This appears to work as expected -- using 3 cores is a little less than 3 times as fast as doing it in one process, and I get the expected result. Is this the correct pattern? Like I'm happy with what's happening but I genuinely did not understand what I read about multiprocessing.Pool or the other stuff in that module so I'm not sure if there was a better way to do what I'm doing.
code:
chunk_size = total / n_processes

def _work_chunk(input, pipe, i):
    db = get a new database connection, read-only
    for row in db.execute(get the ith chunk_size rows):
        do expensive calculations on the input vs this row
    pipe.send(results)

def work_all(input):
    pipes, jobs = [], []
    for i in n_processes:
        out_pipe, in_pipe = mp.Pipe(False)
        job = mp.Process(target=_work_chunk, args=(input, in_pipe, i)
        pipes.append(out_pipe)
        jobs.append(job)
        job.start()
        
    for job in jobs:
        jobs.join()
    
    all_results = whatever
    for pipe in pipes:
        this chunk's result = pipe.recv()
        add that to all_results
    return all_results
e: Like I can't tell is this cleaner with concurrent.futures.ProcessPoolExecutor or is a sqlite3 query even iterable in that way?

I think you're making things harder than they need to be, I'd recommend looking into multiprocessing.map. The database query probably doesn't benefit from parallelization, so you could have your main process perform the query and then submit N jobs to a pool of workers that you create with multiprocessing.Pool. Using map greatly simplifies your code, you don't have to set up your own pipes or anything the results will just get dumped to a list.

Look here:
https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers

Your procedure would be:
1) Define a function that performs expensive operations on one row of database values
2) Create a pool of N workers with multiprocessing.Pool
3) Query database for all rows that you intend to process
4) Invoke Pool.map, passing it the function handle and all of the data that you want processed. Behind the scenes a queue will be created and the workers will keep working through the queue until it's empty. Pool.map is a blocking call so your main process will pause until execution completes (if you want asynchronous calls then that can be done with other, very similar multiprocessing functions). It'll return a tuple of values, where each entry of the tuple maps back to your input (e.g. length of the output tuple == length of function's inputted iterable == number of rows in your query). You can also add one more worker to your pool (I assume you have 4 cores, not 3?) because the main process is blocked anyway

More broadly, think about what happens in the edge cases of your design: if the first 33% of your rows processed instantaneously (for instance) then 1 of the workers in your pool of 3 is going to finish and sit around doing nothing, right? You probably don't want that. Realistically maybe that will never happen, but it helps to illustrate that the design can result in inefficient use of your resources. Having the workers each process 1 row at a time from queue is likely more efficient, and it's such a common concept that there are a whole bunch of functions in multiprocessing that are designed to operate this way, e.g. you have some large number of tasks (N rows of data to process in an expensive way) and want to assign them to a small pool of workers (M << N) to work through as fast as they can. These functions take care of the boilerplate behind spawning a bunch of processes, giving them a queue of data to work through, waiting for them to finish, and gathering the results into a list; you just need to pass the data and a function to a function, and you get a list back.

QuarkJets fucked around with this message at 12:44 on Dec 14, 2018

QuarkJets
Sep 8, 2008

Furism posted:

I'm a decent C#/TS hobby developer and I need to get into Python for working on some automation stuff at work. Are the courses in the OP still the recommended ones for learning? I don't need to be taught what a function or variable, only any Python-specific concepts (if any) and how to get poo poo done quickly. Any recommendation?

I've heard that Automate the Boring Stuff is a good book for beginners and may be right up your alley

Proteus Jones
Feb 28, 2013



QuarkJets posted:

I've heard that Automate the Boring Stuff is a good book for beginners and may be right up your alley

Is that book still stuck on 2.7 or has it moved to the modern era?

QuarkJets
Sep 8, 2008

Proteus Jones posted:

Is that book still stuck on 2.7 or has it moved to the modern era?

No idea! IMO it doesn't matter to a newbie anyway, because the fundamentals are the same

Proteus Jones
Feb 28, 2013



QuarkJets posted:

No idea! IMO it doesn't matter to a newbie anyway, because the fundamentals are the same

True enough

Furism
Feb 21, 2006

Live long and headbang
Cool, will try this thanks! I'll start by automating the creation of RSA/DSA certs and move from there.

9-Volt Assault
Jan 27, 2007

Beter twee tetten in de hand dan tien op de vlucht.

Proteus Jones posted:

Is that book still stuck on 2.7 or has it moved to the modern era?

It has always been Python 3. You are probably thinking about Learn Python The Hard Way, from Zed "Python 3 is not turing complete" Shaw?

And saying that python 2 or 3 doesnt matter in tyool 2018 is just wrong.

SurgicalOntologist
Jun 17, 2004


In addition to what QuarkJets said, yes it would be cleaner with concurrent.futures. It's a higher level interface.

Dominoes
Sep 20, 2007

Let me know if you'd like another guide updated in OP. I'm not adding LPTHW.

QuarkJets
Sep 8, 2008

9-Volt Assault posted:

And saying that python 2 or 3 doesnt matter in tyool 2018 is just wrong.

To someone who's just trying to get a handle on how to use Python at all it doesn't really matter, the difference to them is going to be whether or not print is a function who cares

Adbot
ADBOT LOVES YOU

KICK BAMA KICK
Mar 2, 2009

Big thanks, this does give you much cleaner code but I ran into some issues my naive implementation was (unknowingly) sidestepping. Pool.map calls list(iterable) which loads my too large database into memory at once; Pool.imap doesn't do that but is there a good way to wrap my database access so the sqlite3 connection doesn't complain about being access from multiple threads? (I'm using a read-only connection to the db, and I see the check_same_thread flag for connections but it seems like that's undocumented for a reason?)

Other thing is, I failed to mention the basic outline of my problem is take input, run expensive_calculation(input, row) for each row in the database and find the row that returns the highest value -- results = Pool.map(rows); max(results) is fine but seems wasteful and I'm also hoping to short-circuit the process with a condition like once I find a score above a certain threshold if I can understand my problem a little better so I might actually have to do this manually?


QuarkJets posted:

You can also add one more worker to your pool (I assume you have 4 cores, not 3?) because the main process is blocked anyway
Funny enough my 10-year-old desktop actually is a tri-core Athlon II, but I was just going by a comment I saw on StackOverflow recommending cpu_count() - 1 so on the four-core machine where the code actually runs I was only using three subprocesses.

e: Oh, query the database in the main thread and submit each row to a concurrent.futures.ProcessPoolExecutor? Then iterate through concurrent.futures.as_completed(those_futures) to get the max. That more on the right track?

ee: Yep, :perfect:, huge thanks

KICK BAMA KICK fucked around with this message at 05:20 on Dec 15, 2018

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply