|
I'm probably going to be the only guy to say that Twisted is a good idea and you should use it. twisted.words is a bit iffy, but it's mostly fine. Have a quick example.
|
# ? May 18, 2012 20:06 |
|
|
# ? May 9, 2024 06:16 |
|
JetsGuy posted:Ok, here's a really dumb question. Let's say I wanted to collect all the stats from a page, for example: http://www.nhl.com/ice/playerstats.htm?season=20112012&gameType=2&team=&position=S&country=&status=&viewName=summary#?navid=nav-sts-indiv I haven't personally used it, but Scrapy is supposed to be the go-to tool for that sort of thing.
|
# ? May 18, 2012 20:22 |
|
JetsGuy posted:Ok, here's a really dumb question. Let's say I wanted to collect all the stats from a page, for example: http://www.nhl.com/ice/playerstats.htm?season=20112012&gameType=2&team=&position=S&country=&status=&viewName=summary#?navid=nav-sts-indiv EDIT: I'm dumb.
|
# ? May 18, 2012 20:50 |
|
Suspicious Dish posted:I'm probably going to be the only guy to say that Twisted is a good idea and you should use it. twisted.words is a bit iffy, but it's mostly fine. Have a quick example. The problem I have with twisted and other asynchronous libraries of that ilk, is that the primary abstractions are geared around making the reactor easier to write, rather than making your code easier to write. Instead of modelling the state of your program, you are forced to write code around the event handlers. This often leads to each method having a cascade of if chains, spreading code that should be together across multiple event handlers. It's like you take your code and turn it inside-out. note: PEP-342 allows you to get around this somewhat, as can be seen in bluelet: https://github.com/sampsyo/bluelet
|
# ? May 18, 2012 21:10 |
|
tef posted:Instead of modelling the state of your program, you are forced to write code around the event handlers. This often leads to each method having a cascade of if chains, spreading code that should be together across multiple event handlers. It's like you take your code and turn it inside-out. Hence we have Deferreds and callback chaining, and inlineCallbacks.
|
# ? May 18, 2012 21:35 |
|
Haystack posted:I haven't personally used it, but Scrapy is supposed to be the go-to tool for that sort of thing. jusion posted:Scrapy (http://doc.scrapy.org/en/0.14/intro/tutorial.html) seems like it would be something good for this. Holy poo poo, this looks like *exactly* the kind of thing I was looking for! Thanks a ton guys, I'll let ya know how it turns out!
|
# ? May 18, 2012 22:01 |
My local DVD rental store site sucks (shut up, I don't have an HTPC), so I have been thinking about scraping it, add the IMdb / Tomato score to the data and then make a (local) site that allows me to order it by score or release date, etc. I want to use Python to do this as a learning project. Three questions: EDIT: Nevermind, Scrapy's spider is exactly what I needed. That's what I get from not reading the thread. a2) What would be the best way to detect changes and update them in the local site? b) Would Django be too much for the 'local site' part? lunar detritus fucked around with this message at 18:54 on May 20, 2012 |
|
# ? May 20, 2012 17:49 |
|
JetsGuy posted:Ok, here's a really dumb question. Let's say I wanted to collect all the stats from a page, for example: http://www.nhl.com/ice/playerstats.htm?season=20112012&gameType=2&team=&position=S&country=&status=&viewName=summary#?navid=nav-sts-indiv I did something for the same game thread mentioned before and just used BeautifulSoup, but next time I think I'm going to be doing something like this. E: Really beaten, sorry Captain Capacitor fucked around with this message at 18:56 on May 20, 2012 |
# ? May 20, 2012 18:36 |
|
gmq posted:a2) What would be the best way to detect changes and update them in the local site? I would periodically poll the website to check if A) the total number of pages has changed and then B) if the number of items on the last page has changed. If the DVD site is as bad as you say, you'll probably have to re-scrape the entire site every time there is a change. gmq posted:b) Would Django be too much for the 'local site' part? Django should be just fine for a (introductory?) learning project. The other viable options (Pyramid, Flask) are best suited to if you want a lower-level introduction to python and web technologies.
|
# ? May 20, 2012 19:06 |
|
Not a question, just testing the new forums codePython code:
|
# ? May 21, 2012 04:37 |
Scrapy is really cool. Finding the right xpath magic words was the hardest part of making the crawler I needed. Now to figure out how to make a Django project and populate its database with the data. Haystack posted:I would periodically poll the website to check if A) the total number of pages has changed and then B) if the number of items on the last page has changed. If the DVD site is as bad as you say, you'll probably have to re-scrape the entire site every time there is a change. It does have a 'new releases' page so checking for number of pages/items is perfect, thanks!
|
|
# ? May 21, 2012 05:36 |
|
Ridgely_Fan posted:Not a question, just testing the new forums code also testing new code Python code:
|
# ? May 21, 2012 06:52 |
|
tef posted:also testing new code How bad is it that the first thing I noticed about that blurb is the operand spacing is not pep8?
|
# ? May 21, 2012 15:30 |
|
deimos posted:How bad is it that the first thing I noticed about that blurb is the operand spacing is not pep8? tef never pep8s his code.
|
# ? May 21, 2012 17:33 |
|
gmq posted:My local DVD rental store site sucks (shut up, I don't have an HTPC), so I have been thinking about scraping it, add the IMdb / Tomato score to the data and then make a (local) site that allows me to order it by score or release date, etc. I want to use Python to do this as a learning project. Not to destroy your idea of a python learning project, but have you also considered a javascript user script that just does an ajax call and retrieves the info as you browse and then injects it onto the page?
|
# ? May 21, 2012 18:02 |
|
Zombywuf posted:tef never pep8s his code. we hired a mug to fix up my spacing for me, problem solved
|
# ? May 21, 2012 18:05 |
|
So I have a question for those of you who have handled a lot of UI from people not you. Generally, either me or other people on my team use my code, so I don't worry too much about doing poo poo like: subprocess.call(command, shell=True) where command might be something that does what I want it to, but is made to go through a listfile or something that the user passes to it. It *needs* to be run in the shell. However, I know shell=True is dangerous. There's nothing stopping someone from doing something like ";sudo rm -r /" in a list file, and then poo poo. So I guess my question here is about ensuring that "command" is cleaned of poo poo like that. I'm sure this is a question that can be covered by entire classes of how to handle user input, but what's the long of the short of it? Do I just have to be very rigid about what input I accept?
|
# ? May 21, 2012 18:44 |
Hughlander posted:Not to destroy your idea of a python learning project, but have you also considered a javascript user script that just does an ajax call and retrieves the info as you browse and then injects it onto the page? I actually thought about it but I want to be able to order the movies by rating or release date as the site is ordered alphabetically.
|
|
# ? May 21, 2012 19:34 |
|
JetsGuy posted:It *needs* to be run in the shell. Why?
|
# ? May 21, 2012 20:29 |
|
Well, as best as I can tell from the documentation, shell native commands cannot be run without shell=True. Unless that is a lie (like the cake )
|
# ? May 22, 2012 03:31 |
|
JetsGuy posted:Well, as best as I can tell from the documentation, shell native commands cannot be run without shell=True. Unless that is a lie (like the cake ) What shell native commands do you think you need to run? What are you building?
|
# ? May 22, 2012 04:00 |
|
At the very least you should be able to do something like for line in sys.stdin and pipe stuff into your script to populate lists. Then sanitize through a regex or something. Although if you are reading in files at all there has got to be a better way than spawning a shell with full environment.
|
# ? May 22, 2012 04:57 |
|
Suspicious Dish posted:What shell native commands do you think you need to run? What are you building? As two simple examples, I'll often write the working directory to the log file with something like. print >> logfile, subprocess.check_output("pwd", shell=True).splitlines()[0] Or, more to the point, I sometimes will write a python wrapper to make (and execute) shell scripts. For example, there's a number of astronomy analysis tools which are made to be run in a shell. To do this for thousands of files, I'll write a python script to write the ten or so shell scripts I'd need for each file, and then write them. Unless I am just being a newbie, I don't see anyway to run a sh script from python without using shell=True. Hed posted:At the very least you should be able to do something like for line in sys.stdin and pipe stuff into your script to populate lists. Then sanitize through a regex or something. Well, I mean, all my python is run from the shell to begin with, so it's not like I'm launching a new shell to begin with. Could you clarify what you mean by sanitize through a regex?
|
# ? May 22, 2012 16:46 |
|
For grabbing the current working directory, you can use os.getcwd().
|
# ? May 22, 2012 17:05 |
|
vikingstrike posted:For grabbing the current working directory, you can use os.getcwd(). Yeah, I had a feeling there was an easier way to grab that, thanks. I still have to run shell scripts though :/.
|
# ? May 22, 2012 17:12 |
|
JetsGuy posted:Yeah, I had a feeling there was an easier way to grab that, thanks. I still have to run shell scripts though :/. Yeah, I don't know of a better way to do that. I use subprocess.check_output(), too.
|
# ? May 22, 2012 17:30 |
|
JetsGuy posted:Unless I am just being a newbie, I don't see anyway to run a sh script from python without using shell=True. You explicitly call out to /bin/bash: code:
|
# ? May 22, 2012 17:51 |
|
You could always use Python to create the bash scripts, and then run them from another bash script.
|
# ? May 22, 2012 18:11 |
|
JetsGuy posted:Unless I am just being a newbie, I don't see anyway to run a sh script from python without using shell=True. If you're running a sh script, then the script starts its own shell. It doesn't borrow the shell it's invoked through. JetsGuy posted:Well, I mean, all my python is run from the shell to begin with, so it's not like I'm launching a new shell to begin with. There is the interactive shell that you start your python script in. Then there is a different, noninteractive shell that python invokes because you set shell=True. Then there is a different, noninteractive shell that the shell script runs in. If you set shell=False, then python will avoid uselessly starting that extra shell in the middle, and you can avoid some security issues.
|
# ? May 22, 2012 18:20 |
I have been going through Google Python Class and while I have been getting the results that the exercises require, the solutions have wildly different code. For example, one of the exercises is: quote:# E. Given two lists sorted in increasing order, create and return a merged This is my code: Python code:
Python code:
quote:linear_merge My question is, should I have thought up Google's solution on my own or is my version okay for what was required? EDIT: My real question is, do I suck? lunar detritus fucked around with this message at 18:24 on May 22, 2012 |
|
# ? May 22, 2012 18:21 |
|
Your solution doesn't merge them in linear time. The sorted call runs in O(n logn) time.
|
# ? May 22, 2012 18:30 |
|
gmq posted:I have been going through Google Python Class and while I have been getting the results that the exercises require, the solutions have wildly different code. Well, although your version produces the correct result, part of the problem stipulation was: "Ideally, the solution should work in "linear" time, making a single pass of both lists." Google's solution, as you can see, does that. It is O(N). On the other hand, your solution doesn't; it merges both lists, and then calls Python's internal sorter to figure out the rest. Your runtime is thus O(N*log(N)). Now, that's not to say your solution is objectively worse in all circumstances. Your code is more robust (it doesn't have the input requirement that the two lists be pre-sorted for it to work, for instance), is more readable, and may actually even be faster for medium-sized lists, as "sorted" is a single, optimized, internal instruction rather than individual, interpreted python instructions. So, you shouldn't feel "bad" or whatever about your code....... BUT, you're still gonna get bent over the teacher's knee and paddled in front of all of Google University for your epic failure here...
|
# ? May 22, 2012 18:38 |
|
Talk about premature optimization. Really, if your aim is to write fast code before readable why write it in a language that puts the emphasis on working the other way round? (Yes, the google university code is legible but it's not as simple as gmq's.)
|
# ? May 22, 2012 18:45 |
|
KARMA! posted:Talk about premature optimization. Really, if your aim is to write fast code before readable why write it in a language that puts the emphasis on working the other way round? the purpose of the class might be to teach programming in addition to teaching python, I guess? so it could be that they're trying to think more about the structure of the data? I dunno.
|
# ? May 22, 2012 19:07 |
|
I think it's really just a basic case of "you probably know what merge sort is, so implement it in Python." It's just a learning exercise so I would imagine they're more concerned about using familiar algorithms than developing a new algorithm that gets best runtime in every use case...
|
# ? May 22, 2012 19:09 |
|
They're probably also concerned with making sure people are aware of the cost of calling functions like "sorted". If you don't realize it's actually n*log(n), you might use it in cases where you really shouldn't.
|
# ? May 22, 2012 19:54 |
|
gmq posted:The solutions have wildly different code. This happens. Make sure you understand how the given solution works, and the possible differences in speed/readability that result. quote:This is my code: I like this. Except you modify list1 in-place. To be more like the google example, you can return a new list without modifying the contents of list1. Python code:
quote:My real question is, do I suck? Nope. Using the built-ins to achieve what you need is always a good strategy. The major problem I can see is mutating list1 in place. Despite this, people will tell you its performance without profiling it. Don't listen to these people, they are bad people. (also they do not know how python's sort works) leterip posted:The sorted call runs in O(n logn) time. German Joey posted:Your runtime is thus O(N*log(N)). You guys might enjoy finding out how sort works in python http://en.wikipedia.org/wiki/Timsort Anyway, I drew a graph, with time vs lists of range(0,size) because the lists are pre-sorted, sort() will run in O(n) time - Timsort owns - It is an adaptive merge sort - it should scan the list *once*, split the list into two pre-sorted chunks and then perform a merge. You shouldn't see O(n*logn) performance here. Note: A quick google shows that this strategy breaks down around lists of > 1000000 elements http://stackoverflow.com/a/482848 tef fucked around with this message at 20:22 on May 22, 2012 |
# ? May 22, 2012 20:05 |
|
Tef what data did you use for sorting? If you have actual random integers, timsort should still be nlog(n) I think.
|
# ? May 22, 2012 20:11 |
|
Ridgely_Fan posted:Tef what data did you use for sorting? If you have actual random integers, timsort should still be nlog(n) I think. For some reason I assumed the lists were sorted. I am a bad person Edit: Oh I was right quote:# E. Given two lists sorted in increasing order tef fucked around with this message at 20:17 on May 22, 2012 |
# ? May 22, 2012 20:13 |
|
|
# ? May 9, 2024 06:16 |
|
D'oh. You're right. Well I learned something today.
|
# ? May 22, 2012 20:14 |