|
Selenium has a subset of BS's functionality, but it also literally runs in a live browser, so it can work with pages that require javascript to run, or need to be interacted with to load the stuff you want, or otherwise have some awkward workflow. It's also able to do actual automation for testing and the like, as well as basic scraping BS is more powerful and expressive (I think, been a while since I used Selenium) but it does rely on you getting the HTML you need to scrape. It has some basic networking stuff, or you can use something like Requests for more power, but some sites just don't play nice, and it can be a lot less hassle to have Selenium run an actual full browser to get the page loaded
|
# ? Dec 11, 2018 00:56 |
|
|
# ? May 14, 2024 07:05 |
|
An often overlooked alternative is to investigate the site's requests using dev tools (particularly the network tab). Sometimes there's a url you can hit for JSON data and not have to parse any HTML. Especially if you find yourself needing JavaScript to get the data--it has to come from somewhere.
|
# ? Dec 11, 2018 03:44 |
|
bob dobbs is dead posted:dont learn oop
|
# ? Dec 11, 2018 07:44 |
|
Python classes are cool cause everything in Python is cool, oh you want to add another attribute to this object sure whatever you want bub there you go what you want to use setters/getters but want it to look like you're not just accessing the attribute directly, okay whatever you say boss
|
# ? Dec 11, 2018 07:46 |
|
baka kaba posted:Selenium has a subset of BS's functionality, but it also literally runs in a live browser, so it can work with pages that require javascript to run, or need to be interacted with to load the stuff you want, or otherwise have some awkward workflow. It's also able to do actual automation for testing and the like, as well as basic scraping cheers guys. Sounds like its best to learn both.
|
# ? Dec 11, 2018 18:29 |
|
baka kaba posted:If it's not clear, Python's a dynamic language where you can just assign attributes and functions to objects whenever you like. So you can take any thing and go bitmap.butts = 101 or whatever you like - under the hood there's a local namespace with a dictionary of attributes and functions, and you can add and remove from that however you want So just to get my puny little mind over the semantics: code:
I'm manually converting and rewriting a function based script to a class based script, so the main reason for the questions is so when I'm rewriting, I needed to make sure that I knew that within a class, name != self.name.
|
# ? Dec 11, 2018 21:02 |
|
I think it's more accurate to say that the first is printing test.name, where test is an instance rather than a class. And if you didn't define self.name, then test.name would refer to the class variable
|
# ? Dec 12, 2018 00:50 |
|
You can look at the dictionaries themselves and see where the entries are:code:
code:
*this is a little bit a lie. There are some other names that are somewhere else. That's why you don't see an entry for "__class__" in test and test2's dicts even though you can access "test.__class__" without getting an error. And there's a few more places that get searched for base clases.
|
# ? Dec 12, 2018 05:43 |
|
QuarkJets posted:I think it's more accurate to say that the first is printing test.name, where test is an instance rather than a class. So test.name will look for the 'test' instance of the name first, and if it doesn't exist, then fall back to the class variable? Also, thanks for answering all these nitpicky questions, this really helps me grasp class variable inheritances much more cleanly. This isn't directly related to my code, but may be of help in the future, and greatly helps my understanding of classes in general.
|
# ? Dec 12, 2018 12:59 |
|
TLDR: I get ValueError: ('cannot index with vector containing NA / NaN values', 'occurred at index 0') but I dont think I have an NA values. I have code that wants to look up a string from df["StatCat"] in ArrestTable and return whether its a misdemenor, felony, other in the column "Lev". It will do eventually this for several other columns as well. Because the records arent always exacting in writing the statute number thats in StatCat, it needs to be a fuzzy logic match. code:
code:
code:
code:
code:
CarForumPoster fucked around with this message at 15:34 on Dec 12, 2018 |
# ? Dec 12, 2018 14:42 |
|
Gothmog1065 posted:So test.name will look for the 'test' instance of the name first, and if it doesn't exist, then fall back to the class variable? I dunno if you have any experience with other languages, but defining a variable on a class is basically setting an attribute that's inherent to all instances of that class. So if you change the value of that attribute, all instances are affected because it's a shared 'trait' if you like Python code:
But in __init___ we're setting a name variable, on the instance itself. Like how you can define local variables inside a function, that are only visible in that scope, you're setting a value on the instance's local dictionary of variables. So you can create two dogs with different names, and that name assignment will happen locally for each of them - and you can change the names, or even delete them, and it'll only happen to the instance you're doing that to. The two instances are both Dogs, but you're not changing a shared variable in the Dog class, you're messing with their local, independent attributes in each instance The whole "local scope in a function" thing is really how it all works, and that's what Foxfire_ is getting at. The class variables are like a higher, 'global' scope, but the instance objects have their own local scope too. If you try to read dog_instance.whatever it'll first check the local scope in dog_instance to see if that variable is defined, and if not it'll go up the chain to the Dg class to see if that has this particular variable named. (And if it doesn't, and the class inherits from another class, it can go up the chain to see if it's defined anywhere in the hierarchy) When you assign a value, you're writing it to the local dictionary of whatever you're doing the assignation on. If you do it to the instance, you're creating a local value, so when you try and read it from the instance you'll immediately find that local value and get that back - it won't need to go looking in the class or any of its parents. This means you can shadow a variable in a higher scope - basically overriding it with another variable with the exact same name that will get read instead. You're not changing that attribute in the higher scope, you're creating a new local one that will take precedence (and in some situations your IDE will warn you about this) Python code:
If you want to change the value in Dog, you gotta do it explicitly in the same way Python code:
Also notice that the Dog class doesn't have a name attribute. The __init__ block assigns one to each instance, so it's present in each of their local dictionaries, but the class itself has no knowledge of it. It's a subtle difference, because every instance of Dog has a value for this (unless you deleted it for some reason), but it's not a variable that's present in the class. You can think of it like a lack of a default, if you like baka kaba fucked around with this message at 18:35 on Dec 12, 2018 |
# ? Dec 12, 2018 18:28 |
CarForumPoster posted:TLDR: I get ValueError: ('cannot index with vector containing NA / NaN values', 'occurred at index 0') but I dont think I have an NA values. I don't understand why your find_law function just addresses globals directly. Or how exactly it is meant to work, to be clear, I''m failing to even replicate a ValueError there. Here's a cleaner example that works and hopefully does what I assume you intended to achieve. Python code:
|
|
# ? Dec 12, 2018 19:02 |
|
Instead of df_raw["StatCat"][row], try row["StatCat"] inside the function you're applying. The point of apply is that the looping is done internally so you shouldn't have to index back into the dataframe. The row that is passed to apply is not the row index, it's the actual row data as a series.
|
# ? Dec 12, 2018 19:15 |
SurgicalOntologist posted:Instead of df_raw["StatCat"][row], try row["StatCat"] inside the function you're applying. The point of apply is that the looping is done internally so you shouldn't have to index back into the dataframe. The row that is passed to apply is not the row index, it's the actual row data as a series. Yeah I was wondering if there is some indexing magic going on that can actually work like in that example or not.
|
|
# ? Dec 12, 2018 19:19 |
|
Hi guys, had another question: I was practicing web scraping some more and decided to practice iterating through the pages of a web comic and saving the image files locally to my hd. I have the code below that gets the link to the penny arcade comic link: Python code:
Thanks so much for any help!
|
# ? Dec 12, 2018 19:39 |
|
thing[key] is a conventional Python way of accessing an item - the places you're most likely to see it are in things like dictionaries (passing in a key to get a value) and lists (passing in an index). It's done by implementing a couple of special functions in a class, __getitem__ and __setitem__, which pass in the key and do whatever to return the appropriate result So BeautifulSoup is giving you an object (a <class 'bs4.element.Tag'> in this case) that implements those methods, so you can do easy key/value lookups as though it's a dictionary, so it's a very pythony way of working with stuff. It's allowing you to access the HTML attributes on the tag by name, by taking the name as a key. That's how it works under the hood anyway - it's extremely worth looking at the vast documentation to see what other methods you can use to wrangle stuff, it's its own thing to learn and you can do a lot!
|
# ? Dec 12, 2018 20:06 |
|
baka kaba posted:thing[key] is a conventional Python way of accessing an item - the places you're most likely to see it are in things like dictionaries (passing in a key to get a value) and lists (passing in an index). It's done by implementing a couple of special functions in a class, __getitem__ and __setitem__, which pass in the key and do whatever to return the appropriate result Thank you for the explanation. I remembered to do a type(i) in the loop after posting the question which gave me the bs4.element.tag. Reading up the BS docs on tags now and it cleared it up a bunch. This stuff is fun and frustrating at the same time.
|
# ? Dec 12, 2018 20:15 |
|
Yeah BS is kinda overwhelming at first, and there's more than one way of doing things too Personally I like using CSS selectors, which is its own syntax to learn but it can be pretty neat. But yeah you still have to pull out the data you need once you have the elements you wanted! btw, you're probably ending up with a lot of images just doing find_all - if you inspect the page, the comic pic is inside a div with an id called comicFrame, so if you can grab that element and then do a find call for img tags on that (instead of the whole document) you'll target exactly what you need. (Or using CSS selectors, soup.select_one('#comicFrame img'))
|
# ? Dec 12, 2018 20:36 |
|
baka kaba posted:I dunno if you have any experience with other languages, but defining a variable on a class is basically setting an attribute that's inherent to all instances of that class. So if you change the value of that attribute, all instances are affected because it's a shared 'trait' if you like No, no real experience with other languages (hence my obvious dumb questions). This has been incredibly helpful in my coding experiences.
|
# ? Dec 12, 2018 20:51 |
|
I was just gonna draw some parallels to Java if that helped! *record scratch*
|
# ? Dec 13, 2018 17:50 |
|
keyframe posted:Hi guys, had another question: I have a lot of page scraping experience from my previous job. I trained a few people in my time there and the absolute fastest way to get someone comfortable with selectors/xpaths etc is starting a selenium browser window in the interpreter. Now use the browser's inspection tool to figure out the data you want to access (right click on something your interested in on the page->inspect)*. Next, go back to the interpreter console and try out your xpaths and adjust them until you get what you want. Once you have the xpath figured out you can just plug that into BS or do whatever you want with it. Hope I explained that well enough, let me know if you need more detail *edit one thing I should note: if you are using selenium browser like this, make sure you navigate to the url you are interested in using the interpreter command line, not the selenium browser URL bar As a suggestion for somebody new learning xpaths, its just more visual i guess. Also, using the interpreter allows you to pretty much code as you go, so you wouldn't have to find your xpath, copy it into your code, run it and see the result. You could just boom do it right in the interpreter, move on to the next thing and copy the appropriate lines from your interpreter history when you're finished. VV Feral Integral fucked around with this message at 20:03 on Dec 13, 2018 |
# ? Dec 13, 2018 19:18 |
|
How come you use Selenium for that? I usually just inspect in a normal browser like Chrome, you can select an element to see the hierarchy at the bottom, and do ctrl+F to type a selector or XPath and see if it highlights the right thing
|
# ? Dec 13, 2018 19:43 |
|
SurgicalOntologist posted:Instead of df_raw["StatCat"][row], try row["StatCat"] inside the function you're applying. The point of apply is that the looping is done internally so you shouldn't have to index back into the dataframe. The row that is passed to apply is not the row index, it's the actual row data as a series. cinci zoo sniper posted:Yeah I was wondering if there is some indexing magic going on that can actually work like in that example or not. This was the issue. Thank you both. The new issue is that the fuzzy logic function is hilariously slow. I need to run it 1M rows. It takes 2m 30s to run on 1000 rows. I'm going to try to figure out something with .map and str.contains or some other solution that ends up being "good enough"
|
# ? Dec 13, 2018 22:16 |
|
Do you have python-levenshtein installed? IIRC it's not a requirement of fuzzywuzzy but if it's available it will run faster. Other than that I would just bite the bullet and run it over the weekend, assuming you only need to run it once.
|
# ? Dec 13, 2018 22:31 |
|
SurgicalOntologist posted:Do you have python-levenshtein installed? IIRC it's not a requirement of fuzzywuzzy but if it's available it will run faster. Other than that I would just bite the bullet and run it over the weekend, assuming you only need to run it once. Yes, still slow as balls and I'm doing EDA with a large, new data set so I tend to run it frequently. I made some improvements to the lookup table and now am up to 20% NaNs instead of 40% NaNs and it runs lightning fast, but still misses anything that's not an exact match. code:
the lookup table has 456.023&1a 456.023&1b But I am fine with returning the first one that it matches with, as its accurate enough, so fuzzywuzzy was returning the result after matching on 456.023&1a CarForumPoster fucked around with this message at 22:54 on Dec 13, 2018 |
# ? Dec 13, 2018 22:49 |
CarForumPoster posted:Yes, still slow as balls and I'm doing EDA with a large, new data set so I tend to run it frequently. If your target-catalog are all written like XXX.YYY&something then I would drop fuzzy match altogether and just create a crude index manually, by doing a “string starts with” filtering until the special symbol. That should be comically faster, and I assume there’s somewhere a library interpretation for that ironed out.
|
|
# ? Dec 13, 2018 23:15 |
|
Guys I am so loving stoked, wrote my first "I did it on my own" code in Python since I started learning it a month ago and it works. Thank you all who helped, you guys are awesome. It scrapes a year worth of comics from penny arcade and saves it to a folder. Not necessarily a fan of PA but they had that 2018/xx/xx format at the end of the url which presented a fun challenge to solve. I will post my code for you guys to laugh at. Python code:
|
# ? Dec 13, 2018 23:39 |
|
Never done any multiprocessing before. Have a problem that amounts to polling for some input, then upon getting some, comparing it to each item in a pre-existing database with an expensive calculation. After my basic research into the multiprocessing module I cobbled together a solution in the form of splitting the database into n chunks and spawning a process to evaluate the input against each of those n chunks and then compiling the n results. This appears to work as expected -- using 3 cores is a little less than 3 times as fast as doing it in one process, and I get the expected result. Is this the correct pattern? Like I'm happy with what's happening but I genuinely did not understand what I read about multiprocessing.Pool or the other stuff in that module so I'm not sure if there was a better way to do what I'm doing.code:
KICK BAMA KICK fucked around with this message at 08:14 on Dec 14, 2018 |
# ? Dec 14, 2018 06:53 |
|
I'm a decent C#/TS hobby developer and I need to get into Python for working on some automation stuff at work. Are the courses in the OP still the recommended ones for learning? I don't need to be taught what a function or variable, only any Python-specific concepts (if any) and how to get poo poo done quickly. Any recommendation?
|
# ? Dec 14, 2018 10:58 |
|
KICK BAMA KICK posted:Never done any multiprocessing before. Have a problem that amounts to polling for some input, then upon getting some, comparing it to each item in a pre-existing database with an expensive calculation. After my basic research into the multiprocessing module I cobbled together a solution in the form of splitting the database into n chunks and spawning a process to evaluate the input against each of those n chunks and then compiling the n results. This appears to work as expected -- using 3 cores is a little less than 3 times as fast as doing it in one process, and I get the expected result. Is this the correct pattern? Like I'm happy with what's happening but I genuinely did not understand what I read about multiprocessing.Pool or the other stuff in that module so I'm not sure if there was a better way to do what I'm doing. I think you're making things harder than they need to be, I'd recommend looking into multiprocessing.map. The database query probably doesn't benefit from parallelization, so you could have your main process perform the query and then submit N jobs to a pool of workers that you create with multiprocessing.Pool. Using map greatly simplifies your code, you don't have to set up your own pipes or anything the results will just get dumped to a list. Look here: https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers Your procedure would be: 1) Define a function that performs expensive operations on one row of database values 2) Create a pool of N workers with multiprocessing.Pool 3) Query database for all rows that you intend to process 4) Invoke Pool.map, passing it the function handle and all of the data that you want processed. Behind the scenes a queue will be created and the workers will keep working through the queue until it's empty. Pool.map is a blocking call so your main process will pause until execution completes (if you want asynchronous calls then that can be done with other, very similar multiprocessing functions). It'll return a tuple of values, where each entry of the tuple maps back to your input (e.g. length of the output tuple == length of function's inputted iterable == number of rows in your query). You can also add one more worker to your pool (I assume you have 4 cores, not 3?) because the main process is blocked anyway More broadly, think about what happens in the edge cases of your design: if the first 33% of your rows processed instantaneously (for instance) then 1 of the workers in your pool of 3 is going to finish and sit around doing nothing, right? You probably don't want that. Realistically maybe that will never happen, but it helps to illustrate that the design can result in inefficient use of your resources. Having the workers each process 1 row at a time from queue is likely more efficient, and it's such a common concept that there are a whole bunch of functions in multiprocessing that are designed to operate this way, e.g. you have some large number of tasks (N rows of data to process in an expensive way) and want to assign them to a small pool of workers (M << N) to work through as fast as they can. These functions take care of the boilerplate behind spawning a bunch of processes, giving them a queue of data to work through, waiting for them to finish, and gathering the results into a list; you just need to pass the data and a function to a function, and you get a list back. QuarkJets fucked around with this message at 12:44 on Dec 14, 2018 |
# ? Dec 14, 2018 12:33 |
|
Furism posted:I'm a decent C#/TS hobby developer and I need to get into Python for working on some automation stuff at work. Are the courses in the OP still the recommended ones for learning? I don't need to be taught what a function or variable, only any Python-specific concepts (if any) and how to get poo poo done quickly. Any recommendation? I've heard that Automate the Boring Stuff is a good book for beginners and may be right up your alley
|
# ? Dec 14, 2018 12:34 |
|
QuarkJets posted:I've heard that Automate the Boring Stuff is a good book for beginners and may be right up your alley Is that book still stuck on 2.7 or has it moved to the modern era?
|
# ? Dec 14, 2018 12:43 |
|
Proteus Jones posted:Is that book still stuck on 2.7 or has it moved to the modern era? No idea! IMO it doesn't matter to a newbie anyway, because the fundamentals are the same
|
# ? Dec 14, 2018 12:47 |
|
QuarkJets posted:No idea! IMO it doesn't matter to a newbie anyway, because the fundamentals are the same True enough
|
# ? Dec 14, 2018 12:47 |
|
Cool, will try this thanks! I'll start by automating the creation of RSA/DSA certs and move from there.
|
# ? Dec 14, 2018 12:50 |
|
Proteus Jones posted:Is that book still stuck on 2.7 or has it moved to the modern era? It has always been Python 3. You are probably thinking about Learn Python The Hard Way, from Zed "Python 3 is not turing complete" Shaw? And saying that python 2 or 3 doesnt matter in tyool 2018 is just wrong.
|
# ? Dec 14, 2018 13:27 |
|
In addition to what QuarkJets said, yes it would be cleaner with concurrent.futures. It's a higher level interface.
|
# ? Dec 14, 2018 15:24 |
|
Let me know if you'd like another guide updated in OP. I'm not adding LPTHW.
|
# ? Dec 14, 2018 18:13 |
|
9-Volt Assault posted:And saying that python 2 or 3 doesnt matter in tyool 2018 is just wrong. To someone who's just trying to get a handle on how to use Python at all it doesn't really matter, the difference to them is going to be whether or not print is a function who cares
|
# ? Dec 14, 2018 18:44 |
|
|
# ? May 14, 2024 07:05 |
|
Big thanks, this does give you much cleaner code but I ran into some issues my naive implementation was (unknowingly) sidestepping. Other thing is, I failed to mention the basic outline of my problem is take input, run expensive_calculation(input, row) for each row in the database and find the row that returns the highest value -- results = Pool.map(rows); max(results) is fine but seems wasteful and I'm also hoping to short-circuit the process with a condition like once I find a score above a certain threshold if I can understand my problem a little better so I might actually have to do this manually? QuarkJets posted:You can also add one more worker to your pool (I assume you have 4 cores, not 3?) because the main process is blocked anyway e: Oh, query the database in the main thread and submit each row to a concurrent.futures.ProcessPoolExecutor? Then iterate through concurrent.futures.as_completed(those_futures) to get the max. That more on the right track? ee: Yep, , huge thanks KICK BAMA KICK fucked around with this message at 05:20 on Dec 15, 2018 |
# ? Dec 14, 2018 20:31 |