Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
I'm probably going to be the only guy to say that Twisted is a good idea and you should use it. twisted.words is a bit iffy, but it's mostly fine. Have a quick example.

Adbot
ADBOT LOVES YOU

Haystack
Jan 23, 2005





JetsGuy posted:

Ok, here's a really dumb question. Let's say I wanted to collect all the stats from a page, for example: http://www.nhl.com/ice/playerstats.htm?season=20112012&gameType=2&team=&position=S&country=&status=&viewName=summary#?navid=nav-sts-indiv

Right now, the only thing I can think of is setting up a python script to save each page (in this case, 1 through 30), and then just having to figure out how to distill the HTML by hand. I guess "beautifulsoup" would work, but I am looking for tips as to how to do this cleanly and reliably.

I haven't personally used it, but Scrapy is supposed to be the go-to tool for that sort of thing.

jusion
Jan 24, 2007


JetsGuy posted:

Ok, here's a really dumb question. Let's say I wanted to collect all the stats from a page, for example: http://www.nhl.com/ice/playerstats.htm?season=20112012&gameType=2&team=&position=S&country=&status=&viewName=summary#?navid=nav-sts-indiv

Right now, the only thing I can think of is setting up a python script to save each page (in this case, 1 through 30), and then just having to figure out how to distill the HTML by hand. I guess "beautifulsoup" would work, but I am looking for tips as to how to do this cleanly and reliably.
Scrapy (http://doc.scrapy.org/en/0.14/intro/tutorial.html) seems like it would be something good for this.

EDIT: I'm dumb.

tef
May 30, 2004

-> some l-system crap ->

Suspicious Dish posted:

I'm probably going to be the only guy to say that Twisted is a good idea and you should use it. twisted.words is a bit iffy, but it's mostly fine. Have a quick example.

The problem I have with twisted and other asynchronous libraries of that ilk, is that the primary abstractions are geared around making the reactor easier to write, rather than making your code easier to write.

Instead of modelling the state of your program, you are forced to write code around the event handlers. This often leads to each method having a cascade of if chains, spreading code that should be together across multiple event handlers. It's like you take your code and turn it inside-out.

note: PEP-342 allows you to get around this somewhat, as can be seen in bluelet: https://github.com/sampsyo/bluelet

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

tef posted:

Instead of modelling the state of your program, you are forced to write code around the event handlers. This often leads to each method having a cascade of if chains, spreading code that should be together across multiple event handlers. It's like you take your code and turn it inside-out.

Hence we have Deferreds and callback chaining, and inlineCallbacks.

JetsGuy
Sep 17, 2003

science + hockey
=
LASER SKATES

Haystack posted:

I haven't personally used it, but Scrapy is supposed to be the go-to tool for that sort of thing.


jusion posted:

Scrapy (http://doc.scrapy.org/en/0.14/intro/tutorial.html) seems like it would be something good for this.

EDIT: I'm dumb.

Holy poo poo, this looks like *exactly* the kind of thing I was looking for! Thanks a ton guys, I'll let ya know how it turns out!

lunar detritus
May 6, 2009


My local DVD rental store site sucks (shut up, I don't have an HTPC), so I have been thinking about scraping it, add the IMdb / Tomato score to the data and then make a (local) site that allows me to order it by score or release date, etc. I want to use Python to do this as a learning project.

Three questions:
a) I'm guessing BeautifulSoup or Scrapy would be the thing to use for the scraping step? Should I download everything with wget before scraping? In case it matters, the site shows 24 DVDs per page (fixed), in alphabetical order (fixed) and they seem to offer 5253 movies, so 219 pages.
EDIT: Nevermind, Scrapy's spider is exactly what I needed. That's what I get from not reading the thread.

a2) What would be the best way to detect changes and update them in the local site?

b) Would Django be too much for the 'local site' part?

lunar detritus fucked around with this message at 18:54 on May 20, 2012

Captain Capacitor
Jan 21, 2008

The code you say?

JetsGuy posted:

Ok, here's a really dumb question. Let's say I wanted to collect all the stats from a page, for example: http://www.nhl.com/ice/playerstats.htm?season=20112012&gameType=2&team=&position=S&country=&status=&viewName=summary#?navid=nav-sts-indiv

Right now, the only thing I can think of is setting up a python script to save each page (in this case, 1 through 30), and then just having to figure out how to distill the HTML by hand. I guess "beautifulsoup" would work, but I am looking for tips as to how to do this cleanly and reliably.

I did something for the same game thread mentioned before and just used BeautifulSoup, but next time I think I'm going to be doing something like this.

E: Really beaten, sorry

Captain Capacitor fucked around with this message at 18:56 on May 20, 2012

Haystack
Jan 23, 2005





gmq posted:

a2) What would be the best way to detect changes and update them in the local site?

I would periodically poll the website to check if A) the total number of pages has changed and then B) if the number of items on the last page has changed. If the DVD site is as bad as you say, you'll probably have to re-scrape the entire site every time there is a change.

gmq posted:

b) Would Django be too much for the 'local site' part?

Django should be just fine for a (introductory?) learning project. The other viable options (Pyramid, Flask) are best suited to if you want a lower-level introduction to python and web technologies.

Emacs Headroom
Aug 2, 2003
Not a question, just testing the new forums code

Python code:
def fib(n):
    if n < 2:
        return n
    return fib(n-1) + fib(n-2)

lunar detritus
May 6, 2009


Scrapy is really cool. Finding the right xpath magic words was the hardest part of making the crawler I needed.

Now to figure out how to make a Django project and populate its database with the data.

Haystack posted:

I would periodically poll the website to check if A) the total number of pages has changed and then B) if the number of items on the last page has changed. If the DVD site is as bad as you say, you'll probably have to re-scrape the entire site every time there is a change.

It does have a 'new releases' page so checking for number of pages/items is perfect, thanks!

tef
May 30, 2004

-> some l-system crap ->

Ridgely_Fan posted:

Not a question, just testing the new forums code

also testing new code

Python code:
print"\n".join(x.rstrip()for x in(lambda y:y["t"](y,7,1))({"t":lambda y,n,w:[
" "*w for _ in xrange(1,w)]+["__"*w]if n == 0 else ["%s%s%s"%(" "*(len(t)//2)
,t," "* (len(t)//2)) for t in y["t"](y,n-1,w)]+["%s%s"%m for m in zip(y["l"](
y,n-1,w), y["r"](y,n-1,w))], "l":lambda y,n,w:["%s/%s"%(" "*i," "*j) for (i,j
) in zip(xrange(w-1,-1,-1), xrange(w,w*2))] if n==0 else ["%s%s%s"%(" "*(len(
t)//2),t," "*(len(t)//2)) for t in y["r"](y,n-1,w)]+["%s%s"%m for m in zip(y[
"t"](y,n-1,w),y["l"](y,n-1,w))],"r":lambda y,n,w:["%s\\%s"%(" "*i," "*j)for(j
,i) in zip(xrange(w-1,-1,-1), xrange(w,w*2))] if n==0 else ["%s%s%s" %(" "* (
len(t)//2),t," "*(len(t)//2)) for t in y["l"](y,n-1,w)]+[("%s%s"%m) for m in(
zip(y["r"](y,n-1,w), y["t"](y,n-1,w)))]})) 
:swoon:

deimos
Nov 30, 2006

Forget it man this bat is whack, it's got poobrain!

tef posted:

also testing new code

Python code:
print"\n".join(x.rstrip()for x in(lambda y:y["t"](y,7,1))({"t":lambda y,n,w:[
" "*w for _ in xrange(1,w)]+["__"*w]if n == 0 else ["%s%s%s"%(" "*(len(t)//2)
,t," "* (len(t)//2)) for t in y["t"](y,n-1,w)]+["%s%s"%m for m in zip(y["l"](
y,n-1,w), y["r"](y,n-1,w))], "l":lambda y,n,w:["%s/%s"%(" "*i," "*j) for (i,j
) in zip(xrange(w-1,-1,-1), xrange(w,w*2))] if n==0 else ["%s%s%s"%(" "*(len(
t)//2),t," "*(len(t)//2)) for t in y["r"](y,n-1,w)]+["%s%s"%m for m in zip(y[
"t"](y,n-1,w),y["l"](y,n-1,w))],"r":lambda y,n,w:["%s\\%s"%(" "*i," "*j)for(j
,i) in zip(xrange(w-1,-1,-1), xrange(w,w*2))] if n==0 else ["%s%s%s" %(" "* (
len(t)//2),t," "*(len(t)//2)) for t in y["l"](y,n-1,w)]+[("%s%s"%m) for m in(
zip(y["r"](y,n-1,w), y["t"](y,n-1,w)))]})) 
:swoon:

How bad is it that the first thing I noticed about that blurb is the operand spacing is not pep8?

Zombywuf
Mar 29, 2008

deimos posted:

How bad is it that the first thing I noticed about that blurb is the operand spacing is not pep8?

tef never pep8s his code.

:smith:

Hughlander
May 11, 2005

gmq posted:

My local DVD rental store site sucks (shut up, I don't have an HTPC), so I have been thinking about scraping it, add the IMdb / Tomato score to the data and then make a (local) site that allows me to order it by score or release date, etc. I want to use Python to do this as a learning project.

Not to destroy your idea of a python learning project, but have you also considered a javascript user script that just does an ajax call and retrieves the info as you browse and then injects it onto the page?

tef
May 30, 2004

-> some l-system crap ->

Zombywuf posted:

tef never pep8s his code.

:smith:

we hired a mug to fix up my spacing for me, problem solved

JetsGuy
Sep 17, 2003

science + hockey
=
LASER SKATES
So I have a question for those of you who have handled a lot of UI from people not you. Generally, either me or other people on my team use my code, so I don't worry too much about doing poo poo like:

subprocess.call(command, shell=True)

where command might be something that does what I want it to, but is made to go through a listfile or something that the user passes to it. It *needs* to be run in the shell.

However, I know shell=True is dangerous. There's nothing stopping someone from doing something like ";sudo rm -r /" in a list file, and then poo poo.

So I guess my question here is about ensuring that "command" is cleaned of poo poo like that. I'm sure this is a question that can be covered by entire classes of how to handle user input, but what's the long of the short of it? Do I just have to be very rigid about what input I accept?

lunar detritus
May 6, 2009


Hughlander posted:

Not to destroy your idea of a python learning project, but have you also considered a javascript user script that just does an ajax call and retrieves the info as you browse and then injects it onto the page?

I actually thought about it but I want to be able to order the movies by rating or release date as the site is ordered alphabetically.

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

JetsGuy posted:

It *needs* to be run in the shell.

Why?

JetsGuy
Sep 17, 2003

science + hockey
=
LASER SKATES

Well, as best as I can tell from the documentation, shell native commands cannot be run without shell=True. Unless that is a lie (like the cake :v: )

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

JetsGuy posted:

Well, as best as I can tell from the documentation, shell native commands cannot be run without shell=True. Unless that is a lie (like the cake :v: )

What shell native commands do you think you need to run? What are you building?

Hed
Mar 31, 2004

Fun Shoe
At the very least you should be able to do something like for line in sys.stdin and pipe stuff into your script to populate lists. Then sanitize through a regex or something.
Although if you are reading in files at all there has got to be a better way than spawning a shell with full environment.

JetsGuy
Sep 17, 2003

science + hockey
=
LASER SKATES

Suspicious Dish posted:

What shell native commands do you think you need to run? What are you building?

As two simple examples, I'll often write the working directory to the log file with something like.

print >> logfile, subprocess.check_output("pwd", shell=True).splitlines()[0]

Or, more to the point, I sometimes will write a python wrapper to make (and execute) shell scripts. For example, there's a number of astronomy analysis tools which are made to be run in a shell. To do this for thousands of files, I'll write a python script to write the ten or so shell scripts I'd need for each file, and then write them.

Unless I am just being a newbie, I don't see anyway to run a sh script from python without using shell=True.

Hed posted:

At the very least you should be able to do something like for line in sys.stdin and pipe stuff into your script to populate lists. Then sanitize through a regex or something.
Although if you are reading in files at all there has got to be a better way than spawning a shell with full environment.

Well, I mean, all my python is run from the shell to begin with, so it's not like I'm launching a new shell to begin with.

Could you clarify what you mean by sanitize through a regex?

vikingstrike
Sep 23, 2007

whats happening, captain
For grabbing the current working directory, you can use os.getcwd().

JetsGuy
Sep 17, 2003

science + hockey
=
LASER SKATES

vikingstrike posted:

For grabbing the current working directory, you can use os.getcwd().

Yeah, I had a feeling there was an easier way to grab that, thanks. I still have to run shell scripts though :/.

vikingstrike
Sep 23, 2007

whats happening, captain

JetsGuy posted:

Yeah, I had a feeling there was an easier way to grab that, thanks. I still have to run shell scripts though :/.

Yeah, I don't know of a better way to do that. I use subprocess.check_output(), too.

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

JetsGuy posted:

Unless I am just being a newbie, I don't see anyway to run a sh script from python without using shell=True.

You explicitly call out to /bin/bash:

code:
subprocess.Popen(['/bin/bash', 'my_script.sh'])
Though I'd recommend against this. If you're using Python as a glorified bash, just use bash, unglorified or not.

vikingstrike
Sep 23, 2007

whats happening, captain
You could always use Python to create the bash scripts, and then run them from another bash script.

ShoulderDaemon
Oct 9, 2003
support goon fund
Taco Defender

JetsGuy posted:

Unless I am just being a newbie, I don't see anyway to run a sh script from python without using shell=True.

If you're running a sh script, then the script starts its own shell. It doesn't borrow the shell it's invoked through.

JetsGuy posted:

Well, I mean, all my python is run from the shell to begin with, so it's not like I'm launching a new shell to begin with.

There is the interactive shell that you start your python script in.
Then there is a different, noninteractive shell that python invokes because you set shell=True.
Then there is a different, noninteractive shell that the shell script runs in.

If you set shell=False, then python will avoid uselessly starting that extra shell in the middle, and you can avoid some security issues.

lunar detritus
May 6, 2009


I have been going through Google Python Class and while I have been getting the results that the exercises require, the solutions have wildly different code.

For example, one of the exercises is:

quote:

# E. Given two lists sorted in increasing order, create and return a merged
# list of all the elements in sorted order. You may modify the passed in lists.
# Ideally, the solution should work in "linear" time, making a single
# pass of both lists.

This is my code:
Python code:
def linear_merge(list1, list2):
  list1.extend(list2)
  list1 = sorted(list1)
  return list1
This is google's solution:
Python code:
def linear_merge(list1, list2):
  # +++your code here+++
  # LAB(begin solution)
  result = []
  # Look at the two lists so long as both are non-empty.
  # Take whichever element [0] is smaller.
  while len(list1) and len(list2):
    if list1[0] < list2[0]:
      result.append(list1.pop(0))
    else:
      result.append(list2.pop(0))

  # Now tack on what's left
  result.extend(list1)
  result.extend(list2)
  return result
Both result in:

quote:

linear_merge
OK got: ['aa', 'bb', 'cc', 'xx', 'zz'] expected: ['aa', 'bb', 'cc', 'xx', 'zz']
OK got: ['aa', 'bb', 'cc', 'xx', 'zz'] expected: ['aa', 'bb', 'cc', 'xx', 'zz']
OK got: ['aa', 'aa', 'aa', 'bb', 'bb'] expected: ['aa', 'aa', 'aa', 'bb', 'bb']

My question is, should I have thought up Google's solution on my own or is my version okay for what was required?

EDIT: My real question is, do I suck?

lunar detritus fucked around with this message at 18:24 on May 22, 2012

leterip
Aug 25, 2004
Your solution doesn't merge them in linear time. The sorted call runs in O(n logn) time.

German Joey
Dec 18, 2004

gmq posted:

I have been going through Google Python Class and while I have been getting the results that the exercises require, the solutions have wildly different code.

For example, one of the exercises is:


This is my code:
Python code:
def linear_merge(list1, list2):
  list1.extend(list2)
  list1 = sorted(list1)
  return list1
This is google's solution:
Python code:
def linear_merge(list1, list2):
  # +++your code here+++
  # LAB(begin solution)
  result = []
  # Look at the two lists so long as both are non-empty.
  # Take whichever element [0] is smaller.
  while len(list1) and len(list2):
    if list1[0] < list2[0]:
      result.append(list1.pop(0))
    else:
      result.append(list2.pop(0))

  # Now tack on what's left
  result.extend(list1)
  result.extend(list2)
  return result
Both result in:


My question is, should I have thought up Google's solution on my own or is my version okay for what was required?

EDIT: My real question is, do I suck?

Well, although your version produces the correct result, part of the problem stipulation was: "Ideally, the solution should work in "linear" time, making a single pass of both lists." Google's solution, as you can see, does that. It is O(N). On the other hand, your solution doesn't; it merges both lists, and then calls Python's internal sorter to figure out the rest. Your runtime is thus O(N*log(N)).

Now, that's not to say your solution is objectively worse in all circumstances. Your code is more robust (it doesn't have the input requirement that the two lists be pre-sorted for it to work, for instance), is more readable, and may actually even be faster for medium-sized lists, as "sorted" is a single, optimized, internal instruction rather than individual, interpreted python instructions. So, you shouldn't feel "bad" or whatever about your code....... BUT, you're still gonna get bent over the teacher's knee and paddled in front of all of Google University for your epic failure here...

karms
Jan 22, 2006

by Nyc_Tattoo
Yam Slacker
Talk about premature optimization. Really, if your aim is to write fast code before readable why write it in a language that puts the emphasis on working the other way round?

(Yes, the google university code is legible but it's not as simple as gmq's.)

German Joey
Dec 18, 2004

KARMA! posted:

Talk about premature optimization. Really, if your aim is to write fast code before readable why write it in a language that puts the emphasis on working the other way round?

(Yes, the google university code is legible but it's not as simple as gmq's.)

the purpose of the class might be to teach programming in addition to teaching python, I guess? so it could be that they're trying to think more about the structure of the data? I dunno.

etcetera08
Sep 11, 2008

I think it's really just a basic case of "you probably know what merge sort is, so implement it in Python." It's just a learning exercise so I would imagine they're more concerned about using familiar algorithms than developing a new algorithm that gets best runtime in every use case...

Emacs Headroom
Aug 2, 2003
They're probably also concerned with making sure people are aware of the cost of calling functions like "sorted". If you don't realize it's actually n*log(n), you might use it in cases where you really shouldn't.

tef
May 30, 2004

-> some l-system crap ->

gmq posted:

The solutions have wildly different code.

This happens. Make sure you understand how the given solution works, and the possible differences in speed/readability that result.

quote:

This is my code:
Python code:
def linear_merge(list1, list2):
  list1.extend(list2)
  list1 = sorted(list1)
  return list1

I like this. Except you modify list1 in-place. To be more like the google example, you can return a new list without modifying the contents of list1.

Python code:
def linear_merge(list1, list2):
    result = []
    result.extend(list1)
    result.extend(list2)
    result.sort()
    return result
This returns a new list, sorted.

quote:

My real question is, do I suck?

Nope. Using the built-ins to achieve what you need is always a good strategy. The major problem I can see is mutating list1 in place.

Despite this, people will tell you its performance without profiling it. Don't listen to these people, they are bad people. (also they do not know how python's sort works)

leterip posted:

The sorted call runs in O(n logn) time.

German Joey posted:

Your runtime is thus O(N*log(N)).

You guys might enjoy finding out how sort works in python :v: http://en.wikipedia.org/wiki/Timsort

Anyway, I drew a graph, with time vs lists of range(0,size)



because the lists are pre-sorted, sort() will run in O(n) time - Timsort owns - It is an adaptive merge sort - it should scan the list *once*, split the list into two pre-sorted chunks and then perform a merge. You shouldn't see O(n*logn) performance here.

Note: A quick google shows that this strategy breaks down around lists of > 1000000 elements http://stackoverflow.com/a/482848

tef fucked around with this message at 20:22 on May 22, 2012

Emacs Headroom
Aug 2, 2003
Tef what data did you use for sorting? If you have actual random integers, timsort should still be nlog(n) I think.

tef
May 30, 2004

-> some l-system crap ->

Ridgely_Fan posted:

Tef what data did you use for sorting? If you have actual random integers, timsort should still be nlog(n) I think.

For some reason I assumed the lists were sorted. I am a bad person

Edit:

Oh I was right

quote:

# E. Given two lists sorted in increasing order

tef fucked around with this message at 20:17 on May 22, 2012

Adbot
ADBOT LOVES YOU

Emacs Headroom
Aug 2, 2003
D'oh. You're right. Well I learned something today.

  • Locked thread