Python information and short questions megathread.

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›484 »

Suspicious Dish: Sep 24, 2011; 2020 is the year of linux on the desktop, bro; Fun Shoe

I'm probably going to be the only guy to say that Twisted is a good idea and you should use it. twisted.words is a bit iffy, but it's mostly fine. Have a quick example.

# ? May 18, 2012 20:06

Adbot: ADBOT LOVES YOU

# ? May 9, 2024 05:30

Haystack: Jan 23, 2005

JetsGuy posted:

Ok, here's a really dumb question. Let's say I wanted to collect all the stats from a page, for example: http://www.nhl.com/ice/playerstats.htm?season=20112012&gameType=2&team=&position=S&country=&status=&viewName=summary#?navid=nav-sts-indiv

Right now, the only thing I can think of is setting up a python script to save each page (in this case, 1 through 30), and then just having to figure out how to distill the HTML by hand. I guess "beautifulsoup" would work, but I am looking for tips as to how to do this cleanly and reliably.

I haven't personally used it, but Scrapy is supposed to be the go-to tool for that sort of thing.

# ? May 18, 2012 20:22

jusion: Jan 24, 2007

JetsGuy posted:

Ok, here's a really dumb question. Let's say I wanted to collect all the stats from a page, for example: http://www.nhl.com/ice/playerstats.htm?season=20112012&gameType=2&team=&position=S&country=&status=&viewName=summary#?navid=nav-sts-indiv

Right now, the only thing I can think of is setting up a python script to save each page (in this case, 1 through 30), and then just having to figure out how to distill the HTML by hand. I guess "beautifulsoup" would work, but I am looking for tips as to how to do this cleanly and reliably.

Scrapy (http://doc.scrapy.org/en/0.14/intro/tutorial.html) seems like it would be something good for this.

EDIT: I'm dumb.

# ? May 18, 2012 20:50

tef: May 30, 2004; -> some l-system crap ->

Suspicious Dish posted:

I'm probably going to be the only guy to say that Twisted is a good idea and you should use it. twisted.words is a bit iffy, but it's mostly fine. Have a quick example.

The problem I have with twisted and other asynchronous libraries of that ilk, is that the primary abstractions are geared around making the reactor easier to write, rather than making your code easier to write.

Instead of modelling the state of your program, you are forced to write code around the event handlers. This often leads to each method having a cascade of if chains, spreading code that should be together across multiple event handlers. It's like you take your code and turn it inside-out.

note: PEP-342 allows you to get around this somewhat, as can be seen in bluelet: https://github.com/sampsyo/bluelet

# ? May 18, 2012 21:10

Suspicious Dish: Sep 24, 2011; 2020 is the year of linux on the desktop, bro; Fun Shoe

tef posted:

Instead of modelling the state of your program, you are forced to write code around the event handlers. This often leads to each method having a cascade of if chains, spreading code that should be together across multiple event handlers. It's like you take your code and turn it inside-out.

Hence we have Deferreds and callback chaining, and inlineCallbacks.

# ? May 18, 2012 21:35

JetsGuy: Sep 17, 2003; science + hockey
=
LASER SKATES

Haystack posted:

I haven't personally used it, but Scrapy is supposed to be the go-to tool for that sort of thing.

jusion posted:

Scrapy (http://doc.scrapy.org/en/0.14/intro/tutorial.html) seems like it would be something good for this.

EDIT: I'm dumb.

Holy poo poo, this looks like *exactly* the kind of thing I was looking for! Thanks a ton guys, I'll let ya know how it turns out!

# ? May 18, 2012 22:01

lunar detritus: May 6, 2009

My local DVD rental store site sucks (shut up, I don't have an HTPC), so I have been thinking about scraping it, add the IMdb / Tomato score to the data and then make a (local) site that allows me to order it by score or release date, etc. I want to use Python to do this as a learning project.

Three questions:
a) I'm guessing BeautifulSoup or Scrapy would be the thing to use for the scraping step? Should I download everything with wget before scraping? In case it matters, the site shows 24 DVDs per page (fixed), in alphabetical order (fixed) and they seem to offer 5253 movies, so 219 pages.
EDIT: Nevermind, Scrapy's spider is exactly what I needed. That's what I get from not reading the thread.

a2) What would be the best way to detect changes and update them in the local site?

b) Would Django be too much for the 'local site' part?

lunar detritus fucked around with this message at 18:54 on May 20, 2012

# ? May 20, 2012 17:49

Captain Capacitor: Jan 21, 2008; The code you say?

JetsGuy posted:

Ok, here's a really dumb question. Let's say I wanted to collect all the stats from a page, for example: http://www.nhl.com/ice/playerstats.htm?season=20112012&gameType=2&team=&position=S&country=&status=&viewName=summary#?navid=nav-sts-indiv

Right now, the only thing I can think of is setting up a python script to save each page (in this case, 1 through 30), and then just having to figure out how to distill the HTML by hand. I guess "beautifulsoup" would work, but I am looking for tips as to how to do this cleanly and reliably.

I did something for the same game thread mentioned before and just used BeautifulSoup, but next time I think I'm going to be doing something like this.

E: Really beaten, sorry

Captain Capacitor fucked around with this message at 18:56 on May 20, 2012

# ? May 20, 2012 18:36

Haystack: Jan 23, 2005

gmq posted:

a2) What would be the best way to detect changes and update them in the local site?

I would periodically poll the website to check if A) the total number of pages has changed and then B) if the number of items on the last page has changed. If the DVD site is as bad as you say, you'll probably have to re-scrape the entire site every time there is a change.

gmq posted:

b) Would Django be too much for the 'local site' part?

Django should be just fine for a (introductory?) learning project. The other viable options (Pyramid, Flask) are best suited to if you want a lower-level introduction to python and web technologies.

# ? May 20, 2012 19:06

Emacs Headroom: Aug 2, 2003

Not a question, just testing the new forums code

Python code:

def fib(n):
    if n < 2:
        return n
    return fib(n-1) + fib(n-2)

# ? May 21, 2012 04:37

lunar detritus: May 6, 2009

Scrapy is really cool. Finding the right xpath magic words was the hardest part of making the crawler I needed.

Now to figure out how to make a Django project and populate its database with the data.

Haystack posted:

I would periodically poll the website to check if A) the total number of pages has changed and then B) if the number of items on the last page has changed. If the DVD site is as bad as you say, you'll probably have to re-scrape the entire site every time there is a change.

It does have a 'new releases' page so checking for number of pages/items is perfect, thanks!

# ? May 21, 2012 05:36

tef: May 30, 2004; -> some l-system crap ->

Ridgely_Fan posted:

Not a question, just testing the new forums code

also testing new code

Python code:

print"\n".join(x.rstrip()for x in(lambda y:y["t"](y,7,1))({"t":lambda y,n,w:[
" "*w for _ in xrange(1,w)]+["__"*w]if n == 0 else ["%s%s%s"%(" "*(len(t)//2)
,t," "* (len(t)//2)) for t in y["t"](y,n-1,w)]+["%s%s"%m for m in zip(y["l"](
y,n-1,w), y["r"](y,n-1,w))], "l":lambda y,n,w:["%s/%s"%(" "*i," "*j) for (i,j
) in zip(xrange(w-1,-1,-1), xrange(w,w*2))] if n==0 else ["%s%s%s"%(" "*(len(
t)//2),t," "*(len(t)//2)) for t in y["r"](y,n-1,w)]+["%s%s"%m for m in zip(y[
"t"](y,n-1,w),y["l"](y,n-1,w))],"r":lambda y,n,w:["%s\\%s"%(" "*i," "*j)for(j
,i) in zip(xrange(w-1,-1,-1), xrange(w,w*2))] if n==0 else ["%s%s%s" %(" "* (
len(t)//2),t," "*(len(t)//2)) for t in y["l"](y,n-1,w)]+[("%s%s"%m) for m in(
zip(y["r"](y,n-1,w), y["t"](y,n-1,w)))]}))

# ? May 21, 2012 06:52

deimos: Nov 30, 2006; Forget it man this bat is whack, it's got poobrain!

tef posted:

also testing new code

Python code:

print"\n".join(x.rstrip()for x in(lambda y:y["t"](y,7,1))({"t":lambda y,n,w:[
" "*w for _ in xrange(1,w)]+["__"*w]if n == 0 else ["%s%s%s"%(" "*(len(t)//2)
,t," "* (len(t)//2)) for t in y["t"](y,n-1,w)]+["%s%s"%m for m in zip(y["l"](
y,n-1,w), y["r"](y,n-1,w))], "l":lambda y,n,w:["%s/%s"%(" "*i," "*j) for (i,j
) in zip(xrange(w-1,-1,-1), xrange(w,w*2))] if n==0 else ["%s%s%s"%(" "*(len(
t)//2),t," "*(len(t)//2)) for t in y["r"](y,n-1,w)]+["%s%s"%m for m in zip(y[
"t"](y,n-1,w),y["l"](y,n-1,w))],"r":lambda y,n,w:["%s\\%s"%(" "*i," "*j)for(j
,i) in zip(xrange(w-1,-1,-1), xrange(w,w*2))] if n==0 else ["%s%s%s" %(" "* (
len(t)//2),t," "*(len(t)//2)) for t in y["l"](y,n-1,w)]+[("%s%s"%m) for m in(
zip(y["r"](y,n-1,w), y["t"](y,n-1,w)))]}))

How bad is it that the first thing I noticed about that blurb is the operand spacing is not pep8?

# ? May 21, 2012 15:30

Zombywuf: Mar 29, 2008

deimos posted:

How bad is it that the first thing I noticed about that blurb is the operand spacing is not pep8?

tef never pep8s his code.

:smith:

# ? May 21, 2012 17:33

Hughlander: May 11, 2005

gmq posted:

My local DVD rental store site sucks (shut up, I don't have an HTPC), so I have been thinking about scraping it, add the IMdb / Tomato score to the data and then make a (local) site that allows me to order it by score or release date, etc. I want to use Python to do this as a learning project.

Not to destroy your idea of a python learning project, but have you also considered a javascript user script that just does an ajax call and retrieves the info as you browse and then injects it onto the page?

# ? May 21, 2012 18:02

tef: May 30, 2004; -> some l-system crap ->

Zombywuf posted:

tef never pep8s his code.

we hired a mug to fix up my spacing for me, problem solved

# ? May 21, 2012 18:05

JetsGuy: Sep 17, 2003; science + hockey
=
LASER SKATES

So I have a question for those of you who have handled a lot of UI from people not you. Generally, either me or other people on my team use my code, so I don't worry too much about doing poo poo like:

subprocess.call(command, shell=True)

where command might be something that does what I want it to, but is made to go through a listfile or something that the user passes to it. It *needs* to be run in the shell.

However, I know shell=True is dangerous. There's nothing stopping someone from doing something like ";sudo rm -r /" in a list file, and then poo poo.

So I guess my question here is about ensuring that "command" is cleaned of poo poo like that. I'm sure this is a question that can be covered by entire classes of how to handle user input, but what's the long of the short of it? Do I just have to be very rigid about what input I accept?

# ? May 21, 2012 18:44

lunar detritus: May 6, 2009

Hughlander posted:

Not to destroy your idea of a python learning project, but have you also considered a javascript user script that just does an ajax call and retrieves the info as you browse and then injects it onto the page?

I actually thought about it but I want to be able to order the movies by rating or release date as the site is ordered alphabetically.

# ? May 21, 2012 19:34

Suspicious Dish: Sep 24, 2011; 2020 is the year of linux on the desktop, bro; Fun Shoe

JetsGuy posted:

It *needs* to be run in the shell.

Why?

# ? May 21, 2012 20:29

JetsGuy: Sep 17, 2003; science + hockey
=
LASER SKATES

Suspicious Dish posted:

Why?

Well, as best as I can tell from the documentation, shell native commands cannot be run without shell=True. Unless that is a lie (like the cake :v:

)

# ? May 22, 2012 03:31

Suspicious Dish: Sep 24, 2011; 2020 is the year of linux on the desktop, bro; Fun Shoe

JetsGuy posted:

Well, as best as I can tell from the documentation, shell native commands cannot be run without shell=True. Unless that is a lie (like the cake )

What shell native commands do you think you need to run? What are you building?

# ? May 22, 2012 04:00

Hed: Mar 31, 2004; Fun Shoe

At the very least you should be able to do something like for line in sys.stdin and pipe stuff into your script to populate lists. Then sanitize through a regex or something.
Although if you are reading in files at all there has got to be a better way than spawning a shell with full environment.

# ? May 22, 2012 04:57

JetsGuy: Sep 17, 2003; science + hockey
=
LASER SKATES

Suspicious Dish posted:

What shell native commands do you think you need to run? What are you building?

As two simple examples, I'll often write the working directory to the log file with something like.

print >> logfile, subprocess.check_output("pwd", shell=True).splitlines()[0]

Or, more to the point, I sometimes will write a python wrapper to make (and execute) shell scripts. For example, there's a number of astronomy analysis tools which are made to be run in a shell. To do this for thousands of files, I'll write a python script to write the ten or so shell scripts I'd need for each file, and then write them.

Unless I am just being a newbie, I don't see anyway to run a sh script from python without using shell=True.

Hed posted:

At the very least you should be able to do something like for line in sys.stdin and pipe stuff into your script to populate lists. Then sanitize through a regex or something.
Although if you are reading in files at all there has got to be a better way than spawning a shell with full environment.

Well, I mean, all my python is run from the shell to begin with, so it's not like I'm launching a new shell to begin with.

Could you clarify what you mean by sanitize through a regex?

# ? May 22, 2012 16:46

vikingstrike: Sep 23, 2007; whats happening, captain

For grabbing the current working directory, you can use os.getcwd().

# ? May 22, 2012 17:05

JetsGuy: Sep 17, 2003; science + hockey
=
LASER SKATES

vikingstrike posted:

For grabbing the current working directory, you can use os.getcwd().

Yeah, I had a feeling there was an easier way to grab that, thanks. I still have to run shell scripts though :/.

# ? May 22, 2012 17:12

vikingstrike: Sep 23, 2007; whats happening, captain

JetsGuy posted:

Yeah, I had a feeling there was an easier way to grab that, thanks. I still have to run shell scripts though :/.

Yeah, I don't know of a better way to do that. I use subprocess.check_output(), too.

# ? May 22, 2012 17:30

Suspicious Dish: Sep 24, 2011; 2020 is the year of linux on the desktop, bro; Fun Shoe

JetsGuy posted:

Unless I am just being a newbie, I don't see anyway to run a sh script from python without using shell=True.

You explicitly call out to /bin/bash:

code:

subprocess.Popen(['/bin/bash', 'my_script.sh'])

Though I'd recommend against this. If you're using Python as a glorified bash, just use bash, unglorified or not.

# ? May 22, 2012 17:51

vikingstrike: Sep 23, 2007; whats happening, captain

You could always use Python to create the bash scripts, and then run them from another bash script.

# ? May 22, 2012 18:11

ShoulderDaemon: Oct 9, 2003; support goon fund; Taco Defender

JetsGuy posted:

Unless I am just being a newbie, I don't see anyway to run a sh script from python without using shell=True.

If you're running a sh script, then the script starts its own shell. It doesn't borrow the shell it's invoked through.

JetsGuy posted:

Well, I mean, all my python is run from the shell to begin with, so it's not like I'm launching a new shell to begin with.

There is the interactive shell that you start your python script in.
Then there is a different, noninteractive shell that python invokes because you set shell=True.
Then there is a different, noninteractive shell that the shell script runs in.

If you set shell=False, then python will avoid uselessly starting that extra shell in the middle, and you can avoid some security issues.

# ? May 22, 2012 18:20

lunar detritus: May 6, 2009

I have been going through Google Python Class and while I have been getting the results that the exercises require, the solutions have wildly different code.

For example, one of the exercises is:

quote:

# E. Given two lists sorted in increasing order, create and return a merged
# list of all the elements in sorted order. You may modify the passed in lists.
# Ideally, the solution should work in "linear" time, making a single
# pass of both lists.

This is my code:

Python code:

def linear_merge(list1, list2):
  list1.extend(list2)
  list1 = sorted(list1)
  return list1

This is google's solution:

Python code:

def linear_merge(list1, list2):
  # +++your code here+++
  # LAB(begin solution)
  result = []
  # Look at the two lists so long as both are non-empty.
  # Take whichever element [0] is smaller.
  while len(list1) and len(list2):
    if list1[0] < list2[0]:
      result.append(list1.pop(0))
    else:
      result.append(list2.pop(0))

  # Now tack on what's left
  result.extend(list1)
  result.extend(list2)
  return result

Both result in:

quote:

linear_merge
OK got: ['aa', 'bb', 'cc', 'xx', 'zz'] expected: ['aa', 'bb', 'cc', 'xx', 'zz']
OK got: ['aa', 'bb', 'cc', 'xx', 'zz'] expected: ['aa', 'bb', 'cc', 'xx', 'zz']
OK got: ['aa', 'aa', 'aa', 'bb', 'bb'] expected: ['aa', 'aa', 'aa', 'bb', 'bb']

My question is, should I have thought up Google's solution on my own or is my version okay for what was required?

EDIT: My real question is, do I suck?

lunar detritus fucked around with this message at 18:24 on May 22, 2012

# ? May 22, 2012 18:21

leterip: Aug 25, 2004

Your solution doesn't merge them in linear time. The sorted call runs in O(n logn) time.

# ? May 22, 2012 18:30

German Joey: Dec 18, 2004

gmq posted:

I have been going through Google Python Class and while I have been getting the results that the exercises require, the solutions have wildly different code.

For example, one of the exercises is:

This is my code:
Python code:
def linear_merge(list1, list2):
  list1.extend(list2)
  list1 = sorted(list1)
  return list1
This is google's solution:
Python code:
def linear_merge(list1, list2):
  # +++your code here+++
  # LAB(begin solution)
  result = []
  # Look at the two lists so long as both are non-empty.
  # Take whichever element [0] is smaller.
  while len(list1) and len(list2):
    if list1[0] < list2[0]:
      result.append(list1.pop(0))
    else:
      result.append(list2.pop(0))

  # Now tack on what's left
  result.extend(list1)
  result.extend(list2)
  return result
Both result in:

My question is, should I have thought up Google's solution on my own or is my version okay for what was required?

EDIT: My real question is, do I suck?

Well, although your version produces the correct result, part of the problem stipulation was: "Ideally, the solution should work in "linear" time, making a single pass of both lists." Google's solution, as you can see, does that. It is O(N). On the other hand, your solution doesn't; it merges both lists, and then calls Python's internal sorter to figure out the rest. Your runtime is thus O(N*log(N)).

Now, that's not to say your solution is objectively worse in all circumstances. Your code is more robust (it doesn't have the input requirement that the two lists be pre-sorted for it to work, for instance), is more readable, and may actually even be faster for medium-sized lists, as "sorted" is a single, optimized, internal instruction rather than individual, interpreted python instructions. So, you shouldn't feel "bad" or whatever about your code....... BUT, you're still gonna get bent over the teacher's knee and paddled in front of all of Google University for your epic failure here...

# ? May 22, 2012 18:38

karms: Jan 22, 2006; by Nyc_Tattoo; Yam Slacker

Talk about premature optimization. Really, if your aim is to write fast code before readable why write it in a language that puts the emphasis on working the other way round?

(Yes, the google university code is legible but it's not as simple as gmq's.)

# ? May 22, 2012 18:45

German Joey: Dec 18, 2004

KARMA! posted:

Talk about premature optimization. Really, if your aim is to write fast code before readable why write it in a language that puts the emphasis on working the other way round?

(Yes, the google university code is legible but it's not as simple as gmq's.)

the purpose of the class might be to teach programming in addition to teaching python, I guess? so it could be that they're trying to think more about the structure of the data? I dunno.

# ? May 22, 2012 19:07

etcetera08: Sep 11, 2008

I think it's really just a basic case of "you probably know what merge sort is, so implement it in Python." It's just a learning exercise so I would imagine they're more concerned about using familiar algorithms than developing a new algorithm that gets best runtime in every use case...

# ? May 22, 2012 19:09

Emacs Headroom: Aug 2, 2003

They're probably also concerned with making sure people are aware of the cost of calling functions like "sorted". If you don't realize it's actually n*log(n), you might use it in cases where you really shouldn't.

# ? May 22, 2012 19:54

tef: May 30, 2004; -> some l-system crap ->

gmq posted:

The solutions have wildly different code.

This happens. Make sure you understand how the given solution works, and the possible differences in speed/readability that result.

quote:

This is my code:

Python code:

def linear_merge(list1, list2):
  list1.extend(list2)
  list1 = sorted(list1)
  return list1

I like this. Except you modify list1 in-place. To be more like the google example, you can return a new list without modifying the contents of list1.

Python code:

def linear_merge(list1, list2):
    result = []
    result.extend(list1)
    result.extend(list2)
    result.sort()
    return result

This returns a new list, sorted.

quote:

My real question is, do I suck?

Nope. Using the built-ins to achieve what you need is always a good strategy. The major problem I can see is mutating list1 in place.

Despite this, people will tell you its performance without profiling it. Don't listen to these people, they are bad people. (also they do not know how python's sort works)

leterip posted:

The sorted call runs in O(n logn) time.

German Joey posted:

Your runtime is thus O(N*log(N)).

You guys might enjoy finding out how sort works in python :v:

http://en.wikipedia.org/wiki/Timsort

Anyway, I drew a graph, with time vs lists of range(0,size)

because the lists are pre-sorted, sort() will run in O(n) time - Timsort owns - It is an adaptive merge sort - it should scan the list *once*, split the list into two pre-sorted chunks and then perform a merge. You shouldn't see O(n*logn) performance here.

Note: A quick google shows that this strategy breaks down around lists of > 1000000 elements http://stackoverflow.com/a/482848

tef fucked around with this message at 20:22 on May 22, 2012

# ? May 22, 2012 20:05

Emacs Headroom: Aug 2, 2003

Tef what data did you use for sorting? If you have actual random integers, timsort should still be nlog(n) I think.

# ? May 22, 2012 20:11

tef: May 30, 2004; -> some l-system crap ->

Ridgely_Fan posted:

Tef what data did you use for sorting? If you have actual random integers, timsort should still be nlog(n) I think.

For some reason I assumed the lists were sorted. I am a bad person

Edit:

Oh I was right

quote:

# E. Given two lists sorted in increasing order

tef fucked around with this message at 20:17 on May 22, 2012

# ? May 22, 2012 20:13

Adbot: ADBOT LOVES YOU

# ? May 9, 2024 05:30

Emacs Headroom: Aug 2, 2003

D'oh. You're right. Well I learned something today.

# ? May 22, 2012 20:14

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›484 »