Python information and short questions megathread.

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›484 »

vodkat: Jun 30, 2012; cannot legally be sold as vodka

This is probably some really lovely inelegant coding, but I would do something like this

code:

for x in soup.find_all("td"):
    x = str(x.string)
    if 'None' in x:
        x = '0.0'
    list.append(x)

# ? Jan 8, 2016 04:30

Adbot: ADBOT LOVES YOU

# ? Jun 11, 2024 20:43

ButtWolf: Dec 30, 2004; by Jeffrey of YOSPOS

I'm not having a problem getting the data. Just replacing it if it's blank. The html file is 4000 lines long. It seems like it should be easy. Nothing is working.
I can't use findall td, thats like 1000 things I don't need.

ButtWolf fucked around with this message at 04:35 on Jan 8, 2016

# ? Jan 8, 2016 04:32

SurgicalOntologist: Jun 17, 2004

ButtWolf posted:

I'm not having a problem getting the data. Just replacing it if it's blank. The html file is 4000 lines long. It seems like it should be easy. Nothing is working.
I can't use findall td, thats like 1000 things I don't need.

What about my suggestion, find('tr').find_all('td') (see my previous post for details)

Anyways, post the rest of your code. How are you getting strings out of these objects to begin with? Once we see that it will be clearer how to help you.

# ? Jan 8, 2016 04:37

ButtWolf: Dec 30, 2004; by Jeffrey of YOSPOS

code:

import repr as reprlib
import sys
import time
from bs4 import BeautifulSoup
from urllib2 import urlopen  # for Python 3: from urllib.request import urlopen

url_test = open('txt_files/testurl.txt', 'r')

f = open('txt_files/scraped_final.txt', 'w')

for line in url_test:

#html_doc = 'http://www.basketball-reference.com/players/a/arizatr01.html?lid=carousel_player' example line - file is a lot of these lines of urls
	html_doc = line
	soup = BeautifulSoup(urlopen(html_doc, "html.parser"))

	name_tag = soup.find("h1")
	player_name = name_tag.string
  	per36_line_2015 = soup.find("tr", id="per_minute.2015")
	per36_line_2016 = soup.find("tr", id="per_minute.2016")
	adv_line_2015 = soup.find("tr", id="advanced.2015")
	adv_line_2016 = soup.find("tr", id="advanced.2016")
	per_game_line_2015 = soup.find("tr", id="per_game.2015")
	per_game_line_2016 = soup.find("tr", id="per_game.2016")
	f.write("Name: " + player_name + "\n")

#print (repr (string))
	for string in per36_line_2015.stripped_strings:
		f.write(string.encode('ascii', 'ignore') + " ")
	f.write("\n")
	for string in per36_line_2016.stripped_strings:
		f.write(string.encode('ascii', 'ignore') + " ")
	f.write("\n")
	for string in adv_line_2015.stripped_strings:
		f.write(string.encode('ascii', 'ignore') + " ")
	f.write("\n")
	for string in adv_line_2016.stripped_strings:
		f.write(string.encode('ascii', 'ignore') + " ")
	f.write("\n")
	for string in per_game_line_2015.stripped_strings:
		f.write(string.encode('ascii', 'ignore') + " ")
	f.write("\n")
	for string in per_game_line_2016.stripped_strings:
		f.write(string.encode('ascii', 'ignore') + " ")
	f.write("\n")

	print player_name
	time.sleep(1.0)

############OUTPUT
Name: Steven Adams
2014-15 21 OKC NBA C 70 67 1771 4.4 8.1 .544 0.0 0.0 .000 4.4 8.1 .547 2.1 4.2 .502 4.0 6.6 10.6 1.3 0.8 1.7 2.0 4.5 10.9 
2015-16 22 OKC NBA C 36 36 852 3.9 6.8 .581 0.0 0.0 [b].000[/b] 3.9 6.8 .581 1.9 3.1 .622 3.9 5.7 9.6 1.0 0.5 1.9 1.4 4.4 9.8 
2014-15 21 OKC NBA C 70 1771 14.1 .549 .005 .514 12.2 19.3 15.8 5.5 1.1 3.8 16.8 14.3 1.9 2.2 4.1 .111 -1.4 1.8 0.4 1.1 
2015-16 22 OKC NBA C 36 852 14.4 .602 .000 .463 12.8 16.4 14.7 3.9 0.7 3.9 14.3 11.4 1.7 1.0 2.7 .153 0.4 1.2 1.7 0.8 
2014-15 21 OKC NBA C 70 67 25.3 3.1 5.7 .544 0.0 0.0 .000 3.1 5.7 .547 .544 1.5 2.9 .502 2.8 4.6 7.5 0.9 0.5 1.2 1.4 3.2 7.7 
2015-16 22 OKC NBA C 36 36 23.7 2.6 4.4 .581 0.0 0.0 [b].000[/b] 2.6 4.4 .581 .581 1.3 2.1 .622 2.6 3.7 6.3 0.7 0.3 1.3 0.9 2.9 6.4

It works I'm just trying to clean up. I ran the program until it hit an error where it was missing a value (due to the blank not being a .000) For explanation this is for players who have not taken a 3 pointer, there % does not show up as .000 it shows nothing, so 1 out of every 20 players (of 200) I have to do this to when compiling. Not that big of a deal.
The bold .000 is what I have to add, and what I'm trying to get it to add when the td is empty.

ButtWolf fucked around with this message at 04:53 on Jan 8, 2016

# ? Jan 8, 2016 04:47

SurgicalOntologist: Jun 17, 2004

Ah, I didn't know about the stripped_strings method. That's useful. Although I'm quite certain my way would have worked, you would have had to change everything after that.

Python code:

def process_float_string(string):
    encoded = string.encode('ascii', 'ignore')
    if not encoded:
        encoded = '0.0'
    return encoded + ' '

Now replace each f.write(string.encode('ascii', 'ignore') + " ") with f.write(process_float_string(string)).

Also you should look into the csv library, if not pandas, as it can take care of stuff like putting spaces between things when writing csv files.

Or... do you not even get a string when the field is empty?

SurgicalOntologist fucked around with this message at 04:55 on Jan 8, 2016

# ? Jan 8, 2016 04:52

ButtWolf: Dec 30, 2004; by Jeffrey of YOSPOS

SurgicalOntologist posted:

Ah, I didn't know about the stripped_strings method. That's useful. Although I'm quite certain my way would have worked, you would have had to change everything after that.
Python code:
def process_float_string(string):
    encoded = string.encode('ascii', 'ignore')
    if not encoded:
        encoded = '0.0'
    return encoded + ' '
Now replace each f.write(string.encode('ascii', 'ignore') + " ") with f.write(process_float_string(string)).

Also you should look into the csv library, if not pandas, as it can take care of stuff like putting spaces between things when writing csv files.

Or... do you not even get a string when the field is empty?

The output up there is exactly what the text file is per36_line_2015[4] is C, [10] is .544 - I'm doing a lot of calculations based on all of this stuff in another file. I need the file *exactly like it is. *.000 added where it doesn't read anything in the td in the corresponding tr.
SO there is the problem, see how lines 1 and 2 have values that are similar up until those 0s in the middle then a .XXX on line 1 but not on line 2. Thats position [13] so if I wanted to grab it should be 000, but if I run for 2016, it reads 3.9 which is offensive rebounds I think and if I try to pull value [28], the last one - points, it won't read - out of range.

I though beautfiulsoup would have something cool built in but I can't find it. I'm going to bed. Thanks for your help anyway guys.

ButtWolf fucked around with this message at 05:12 on Jan 8, 2016

# ? Jan 8, 2016 04:59

SurgicalOntologist: Jun 17, 2004

The idea is if not encoded tests for an empty string, and replaces it with your placeholder value. Or is the problem that you're not getting a string at all, even an empty one? If that's the case you'll need to go back to my previous suggestion instead of the stripped_strings method. If stripped_strings simply skips empty fields (rather than giving you an empty string) then you shouldn't use it.

# ? Jan 8, 2016 05:09

ButtWolf: Dec 30, 2004; by Jeffrey of YOSPOS

So you think it's a string output problem, not when I'm reading it? That's where we were confused. No that doesn't seem to work. None of it works. I'm burning everything down. kbye

ButtWolf fucked around with this message at 05:15 on Jan 8, 2016

# ? Jan 8, 2016 05:13

SurgicalOntologist: Jun 17, 2004

ButtWolf posted:

I need the file *exactly like it is. *.000 added where it doesn't read anything in the td in the corresponding tr.

Do what you want, but you're giving yourself lots of headaches by tooling your own "write tabular data to a file" function. For example if you used commas then you'd have an easier time finding empty fields.

ButtWolf posted:

I though beautfiulsoup would have something cool built in but I can't find it.

I don't think BeautifulSoup has anything specifically for data handling (which is really what you're asking for). You want pandas, as I suggested before.

Yes, you'll need to retool your following steps as well, but it would be relatively simple using a library designed for data.

For example:

Python code:

import sys
import pandas  as pd


def fix_table(table):
    for column in table.columns:
        if column.startswith('Unnamed:'):
            del table[column]
    return table


url = 'http://www.basketball-reference.com/players/a/arizatr01.html?lid=carousel_player'

tables = pd.read_html(url)

per_36 = fix_table(tables[2])
per_36.to_csv(sys.stdout, index=False)

advanced = fix_table(tables[4])
advanced.to_csv(sys.stdout, index=False)

gives me

code:

Season,Age,Tm,Lg,Pos,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
2004-05,19.0,NYK,NBA,SF,80,12,1382,4.5,10.1,0.442,0.1,0.3,0.231,4.4,9.8,0.449,3.2,4.5,0.695,2.3,4.0,6.3,2.2,1.8,0.5,1.9,3.9,12.2
2005-06,20.0,TOT,NBA,SF,57,10,999,3.4,8.1,0.41200000000000003,0.0,0.2,0.2,3.3,8.0,0.41600000000000004,2.8,4.6,0.606,2.8,5.0,7.9,2.2,2.1,0.4,2.2,3.6,9.5
2005-06,20.0,NYK,NBA,SF,36,10,709,3.1,7.4,0.418,0.1,0.2,0.33299999999999996,3.0,7.3,0.42,2.1,3.9,0.545,2.5,4.5,7.0,2.3,2.2,0.5,2.3,4.0,8.4
2005-06,20.0,ORL,NBA,SF,21,0,290,4.0,9.9,0.4,0.0,0.2,0.0,4.0,9.7,0.41,4.3,6.2,0.7,3.6,6.5,10.1,1.9,1.7,0.2,2.0,2.7,12.3
2006-07,21.0,ORL,NBA,SG,57,7,1278,5.6,10.5,0.539,0.0,0.2,0.0,5.6,10.3,0.5489999999999999,3.0,4.8,0.62,2.8,4.3,7.0,1.8,1.7,0.5,2.4,3.8,14.3
2007-08,22.0,TOT,NBA,SF,35,3,546,4.5,9.0,0.507,0.3,1.2,0.278,4.2,7.8,0.542,3.2,4.9,0.653,2.0,5.1,7.1,2.9,2.0,0.7,1.5,3.0,12.7
2007-08,22.0,ORL,NBA,SF,11,0,115,4.4,9.7,0.452,0.0,0.9,0.0,4.4,8.8,0.5,2.5,4.7,0.5329999999999999,1.6,5.9,7.5,2.5,1.6,0.9,1.6,2.8,11.3
2007-08,22.0,LAL,NBA,SF,24,3,431,4.6,8.8,0.524,0.4,1.3,0.33299999999999996,4.2,7.5,0.556,3.4,5.0,0.6829999999999999,2.1,4.9,7.0,3.0,2.2,0.7,1.5,3.0,13.0
2008-09,23.0,LAL,NBA,SF,82,20,1999,4.9,10.7,0.46,1.1,3.4,0.319,3.8,7.3,0.526,2.2,3.0,0.71,2.0,4.3,6.3,2.6,2.5,0.4,1.6,3.0,13.1
2009-10,24.0,HOU,NBA,SF,72,71,2629,5.4,13.7,0.39399999999999996,1.9,5.6,0.33399999999999996,3.5,8.1,0.436,2.0,3.1,0.649,1.1,4.5,5.5,3.8,1.7,0.5,2.2,2.2,14.7
2010-11,25.0,NOH,NBA,SF,75,75,2600,4.2,10.6,0.39799999999999996,1.1,3.8,0.303,3.1,6.8,0.45,1.9,2.7,0.701,0.8,4.8,5.6,2.2,1.7,0.4,1.6,2.5,11.4
2011-12,26.0,NOH,NBA,SF,41,41,1350,4.5,10.7,0.41700000000000004,0.8,2.3,0.33299999999999996,3.7,8.4,0.44,2.1,2.7,0.775,1.1,4.6,5.7,3.6,1.8,0.7,2.0,1.9,11.8
2012-13,27.0,WAS,NBA,SF,56,15,1471,4.6,11.0,0.41700000000000004,1.9,5.1,0.364,2.7,5.9,0.46299999999999997,1.9,2.3,0.821,1.1,5.4,6.5,2.8,1.8,0.5,2.1,1.8,13.0
2013-14,28.0,WAS,NBA,SF,77,77,2723,5.1,11.3,0.456,2.4,5.8,0.40700000000000003,2.8,5.4,0.509,2.0,2.6,0.772,1.3,5.0,6.3,2.5,1.7,0.3,1.7,2.4,14.6
2014-15,29.0,HOU,NBA,SF,82,82,2930,4.5,11.2,0.402,2.4,6.8,0.35,2.1,4.4,0.485,1.5,1.8,0.853,0.9,4.7,5.6,2.6,1.9,0.2,1.7,2.3,12.9
2015-16,30.0,HOU,NBA,SF,35,35,1192,4.2,10.8,0.39,2.1,6.3,0.33799999999999997,2.1,4.6,0.461,1.4,1.8,0.754,0.7,4.5,5.2,2.1,1.9,0.2,1.7,2.4,12.0
Career,,,NBA,,749,448,21099,4.7,11.0,0.42700000000000005,1.4,4.1,0.34700000000000003,3.3,6.9,0.475,2.1,3.0,0.711,1.4,4.7,6.1,2.7,1.8,0.4,1.9,2.6,12.9
Season,Age,Tm,Lg,Pos,G,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP
2004-05,19.0,NYK,NBA,SF,80,1382,13.3,0.503,0.033,0.447,7.5,13.1,10.3,9.8,2.7,1.0,13.6,17.8,0.8,1.3,2.1,0.073,-1.6,0.7,-1.0,0.4
2005-06,20.0,TOT,NBA,SF,57,999,11.8,0.46799999999999997,0.022000000000000002,0.562,9.6,17.1,13.3,9.8,3.0,0.9,18.0,15.8,-0.1,1.0,0.9,0.043,-2.1,1.5,-0.6,0.4
2005-06,20.0,NYK,NBA,SF,36,709,10.2,0.45899999999999996,0.021,0.527,8.3,15.3,11.7,10.4,3.2,1.0,20.4,14.4,-0.4,0.6,0.3,0.02,-2.3,2.0,-0.3,0.3
2005-06,20.0,ORL,NBA,SF,21,290,15.8,0.485,0.025,0.625,12.7,21.5,17.3,8.3,2.6,0.5,13.6,19.3,0.2,0.4,0.6,0.10099999999999999,-1.6,0.2,-1.4,0.0
2006-07,21.0,ORL,NBA,SG,57,1278,16.2,0.5670000000000001,0.019,0.461,9.7,14.2,12.0,8.9,2.5,1.2,15.8,19.2,1.5,1.9,3.4,0.128,0.2,1.2,1.5,1.1
2007-08,22.0,TOT,NBA,SF,35,546,16.1,0.568,0.132,0.551,6.4,15.4,11.1,11.8,2.9,1.5,12.0,15.6,0.9,0.9,1.8,0.157,0.0,1.8,1.8,0.5
2007-08,22.0,ORL,NBA,SF,11,115,12.7,0.479,0.09699999999999999,0.484,5.2,18.3,12.0,10.7,2.2,1.9,11.7,17.0,0.0,0.2,0.2,0.085,-3.4,1.9,-1.5,0.0
2007-08,22.0,LAL,NBA,SF,24,431,17.1,0.594,0.14300000000000002,0.5710000000000001,6.7,14.6,10.8,12.0,3.0,1.3,12.0,15.3,0.9,0.7,1.6,0.17600000000000002,0.9,1.8,2.7,0.5
2008-09,23.0,LAL,NBA,SF,82,1999,15.5,0.544,0.32,0.284,6.4,13.4,10.0,10.4,3.5,0.9,11.5,16.6,2.5,3.5,6.1,0.146,1.0,2.1,3.1,2.6
2009-10,24.0,HOU,NBA,SF,72,2629,13.3,0.488,0.40700000000000003,0.228,3.3,14.7,8.8,16.7,2.4,1.1,12.8,21.2,0.1,3.1,3.2,0.057999999999999996,0.0,1.3,1.3,2.2
2010-11,25.0,NOH,NBA,SF,75,2600,11.3,0.48700000000000004,0.35600000000000004,0.259,2.8,16.4,9.5,10.0,2.5,1.0,12.2,17.7,-0.4,3.8,3.4,0.062,-1.2,2.3,1.1,2.0
2011-12,26.0,NOH,NBA,SF,41,1350,14.2,0.496,0.21600000000000003,0.253,3.7,15.1,9.5,16.8,2.8,1.5,14.5,18.4,0.5,1.7,2.2,0.078,-0.3,2.2,1.8,1.3
2012-13,27.0,WAS,NBA,SF,56,1471,14.0,0.5379999999999999,0.46299999999999997,0.21100000000000002,3.3,16.9,10.0,12.7,2.5,1.1,14.6,17.9,0.7,2.4,3.1,0.102,-0.4,1.9,1.5,1.3
2013-14,28.0,WAS,NBA,SF,77,2723,15.8,0.59,0.518,0.226,4.1,16.3,10.1,10.8,2.4,0.6,12.3,17.8,4.3,3.7,8.0,0.141,2.3,1.1,3.4,3.6
2014-15,29.0,HOU,NBA,SF,82,2930,12.7,0.539,0.61,0.157,2.9,14.4,8.6,11.1,2.6,0.5,12.7,16.5,2.7,3.9,6.6,0.10800000000000001,0.6,1.3,1.9,2.9
2015-16,30.0,HOU,NBA,SF,35,1192,11.0,0.513,0.5770000000000001,0.17,2.1,13.8,7.9,9.2,2.7,0.5,12.5,16.1,0.4,0.8,1.2,0.048,-0.4,-0.1,-0.4,0.5
Career,,,NBA,,749,21099,13.7,0.525,0.374,0.26899999999999996,4.6,15.1,9.8,11.7,2.7,0.9,13.2,17.8,13.8,28.1,41.9,0.095,0.1,1.5,1.5,18.7

Now pandas also has facilities for handling missing data in your computations so the blanks won't be an issue in your next steps. "Worst case scenario" you'd add something like .fillna(0.0)

# ? Jan 8, 2016 05:15

SurgicalOntologist: Jun 17, 2004

ButtWolf posted:

No that doesn't seem to work. None of it works. I'm burning everything down. kbye

Umm can you be more specific? What did you try, and what happened? You're impossible to help.

# ? Jan 8, 2016 05:17

ButtWolf: Dec 30, 2004; by Jeffrey of YOSPOS

I tried every single thing you suggested and nothing did what I needed. How am I impossible? I'm not getting mad or anything I'm lightheartedly joking and wanting to go to sleep.

# ? Jan 8, 2016 05:19

SurgicalOntologist: Jun 17, 2004

Because you're not saying exactly what you did or what happened.

What you did: "what you suggested" isn't specific enough. Only my most recent post gave an entire example. For example maybe you did something like assuming the line I gave was a drop in replacement for an existing line in your code. And my real suggestion "use the CSV module" is not specific enough to know what you actually yield.

What happened: error? same output as before? something else? In either case due diligence on your end would be to step through the code and see what values the variable in question is taking at the critical moment.

# ? Jan 8, 2016 05:23

ButtWolf: Dec 30, 2004; by Jeffrey of YOSPOS

SurgicalOntologist posted:

Because you're not saying exactly what you did or what happened.

What you did: "what you suggested" isn't specific enough. Only my most recent post gave an entire example. For example maybe you did something like assuming the line I gave was a drop in replacement for an existing line in your code. And my real suggestion "use the CSV module" is not specific enough to know what you actually yield.

What happened: error? same output as before? something else? In either case due diligence on your end would be to step through the code and see what values the variable in question is taking at the critical moment.

Whatever. It now just feels like you are mad at me for not knowing what I'm doing. I'm sorry if I wasn't explaining myself, but I'm new and not good and I only try this crap about once a year, usually because of this. It just seems like all the time, it's "I'm one little thing away from it being done and just the way I want, can I get help?" --- "Rewrite the entire thing, using this other library that you don't know about or some really advanced way of doing it." It's about 6 little files, each 40-100 lines. And this is basically all I need. One <td> file that is empty 10% of the time.

It's frustrating and I know I may have problems explaining myself, so I'm sorry. I'm just gonna go to sleep.

# ? Jan 8, 2016 05:33

SurgicalOntologist: Jun 17, 2004

Fair enough, sorry I was frustrated too.

But to be fair, using some "really advanced" library or module often actually requires less effort, since the hard work has been done by someone else. As I hoped would be clear from the code I posted.

And just because it seems like "one little thing" is wrong, doesn't mean that there's one corresponding little fix. In this case, my guess is that stripped_strings was skipping empty fields rather than giving you an empty string. This required doing it another way than stripped_strings. So already there's going to be some work involved.

I suggested another way, but that involved using more BeautfulSoup finds. Since these can have issues depending on the page (i.e. you often need to add some specifiers to get just the elements you want) there is a trial-and-error process involved. Doing this trial-and-error in the forums is pretty much impossible, so here you go, I did it for you. It turned out that getting all the trs gets you too many, so I needed to specify the class. Otherwise this is pretty much what I was suggesting before.

Python code:

from bs4 import BeautifulSoup
import requests


def process_table(table_element):
    lines = []
    for tr in table_element.find_all('tr', attrs={'class': 'full_table'}):
        lines.append([td.string or '.000' for td in tr.find_all('td')])
    return lines


def write_table(file, table_list):
    file.writelines(' '.join(line) + '\n' for line in table_list)


url = 'http://www.basketball-reference.com/players/a/arizatr01.html?lid=carousel_player'
output_path = 'output.txt'

soup = BeautifulSoup(requests.get(url).text, 'lxml')

with open(output_path, 'w') as f:
    write_table(f, process_table(soup.find('table', id='per_minute')))
    write_table(f, process_table(soup.find('table', id='advanced')))

I didn't feel like troubleshooting all that encoding BS so I just used Python 3. Sorry if you can't use it too, but if not you can probably stick the logic you were using into write_table.

SurgicalOntologist fucked around with this message at 05:49 on Jan 8, 2016

# ? Jan 8, 2016 05:39

SurgicalOntologist: Jun 17, 2004

And to atone for being snappy (and to procastinate on my own work) I attempted to improve the readability a bit, so you could understand what I did. This should also make it easier to add back in your logic for handling multiple urls.

Python code:

from bs4 import BeautifulSoup
import requests


def tables_from_url(url):
    soup = BeautifulSoup(requests.get(url).text, 'lxml')

    name = soup.find('h1').string
    print('reading tables for {}...'.format(name), end='')

    table_ids = ['per_minute', 'advanced']
    tables = [read_table(soup.find('table', id=table_id)) for table_id in table_ids]

    print('done.')
    return tables


def read_table(table_element):
    return [read_line(line) for line in table_element.find_all('tr', attrs={'class': 'full_table'})]


def read_line(line_element):
    return [read_field(field) for field in line_element.find_all('td')]


def read_field(field_element):
    return field_element.string or '.000'


def write_table(file, table_list):
    file.writelines(' '.join(line) + '\n' for line in table_list)


url = 'http://www.basketball-reference.com/players/a/arizatr01.html?lid=carousel_player'
output_path = 'output.txt'

with open(output_path, 'w') as f:
    for table in tables_from_url(url):
        write_table(f, table)

SurgicalOntologist fucked around with this message at 06:12 on Jan 8, 2016

# ? Jan 8, 2016 06:07

floppo: Aug 24, 2005

Another BeautifulSoup type question:

I am scraping links from an html that has a 'pager'. There is some javascript at the end of the html file that seems to govern what the pager does, but basically it goes through the pages of the results without updating the web address in the browser, and I can't figure out how to get the updated html file resulting from going to the next page in an automatic way. Here is the pager in html when I'm on page 8 of the results for example:

<div class="pager">
<span class="current">
8 / 370 page, 7384 results found total
</span>
</div>

</div>
<div class="pager form">
<a href="#" onclick="setPage('7');return false;" class="prev ">R<span>�</span></a>
<a href="#" onclick="setPage('9');return false;" class="next ">R<span>�</span></a>
</div>

</div>
</div>

Can BS handle this somehow?

# ? Jan 8, 2016 10:01

Kuule hain nussivan: Nov 27, 2008

floppo posted:

Another BeautifulSoup type question:

I am scraping links from an html that has a 'pager'. There is some javascript at the end of the html file that seems to govern what the pager does, but basically it goes through the pages of the results without updating the web address in the browser, and I can't figure out how to get the updated html file resulting from going to the next page in an automatic way. Here is the pager in html when I'm on page 8 of the results for example:

<div class="pager">
<span class="current">
8 / 370 page, 7384 results found total
</span>
</div>

</div>
<div class="pager form">
<a href="#" onclick="setPage('7');return false;" class="prev ">R<span>‹</span></a>
<a href="#" onclick="setPage('9');return false;" class="next ">R<span>›</span></a>
</div>

</div>
</div>

Can BS handle this somehow?

I think the common way of doing this is using Selenium to interact with the page and BS to parse the result.

# ? Jan 8, 2016 10:51

floppo: Aug 24, 2005

thanks - Selenium is working very well

# ? Jan 8, 2016 15:49

pmchem: Jan 22, 2010

There's a general problem with python when using it for jobs with large core counts on HPC machines: the file system performance hit on importing modules is quite significant. One approach to solving this problem is here:
https://github.com/rainwoodman/python-mpi-bcast

There are other approaches. Has anyone else in this thread had to deal with this problem? What is your preferred (and preferably low effort) solution?

# ? Jan 9, 2016 18:44

Highblood: May 20, 2012; Let's talk about tactics.

Stupid question incoming: Big newbie to programming here. What's the point of dictionaries? I'm used to the more C-like struct and they seem similar but I don't get the differences. Why assign values to strings? Why not just a list of variables (man reaching for that " key is really hard okay)? From what I understand you can add and remove entries to the dictionary whenever you want but the applications for that fly completely over my head (I'm dumb sorry)

I just feel like whatever I could do with that could just as easily be done without it (C has no associative arrays right?)

Hopefully nobody got an aneurysm from that question.

# ? Jan 11, 2016 06:33

Asymmetrikon: Oct 30, 2009; I believe you're a big dork!

Being able to have a structure hold a runtime-variable amount of entries is really important to a lot of use cases - what if the user needs to be able to add more things to your "list of variables" on the fly, like if you have some kind of list program? What if you don't necessarily know the exact form of the struct you're going to receive because of possible errors or unforseen problems, like doing interaction over the web via JSON? Dictionaries are trivially searchable by their key, which is another neat feature that is more inconvenient without them.

A feature not being in C isn't particularly telling, either; C doesn't have a lot of things built-in most modern languages have, like lists, try/catch, generic types, lambdas, first-class functions, etc. We still like to have those things.

# ? Jan 11, 2016 06:53

Hammerite: Mar 9, 2007; And you don't remember what I said here, either, but it was pompous and stupid.; Jade Ear Joe

Highblood posted:

Stupid question incoming: Big newbie to programming here. What's the point of dictionaries? I'm used to the more C-like struct [...]

Well those two (dictionaries and C structs) are fairly different things. The members of a struct are hard coded whereas you can put what you want into a dictionary at runtime. Your question is a bit like asking what's the use of Lists, or of vector from C++; what's the point when I can just declare an array big enough to hold what I want? The answer to that is another question: what if you don't know how big you want it to be?

Going back to the dictionary example, maybe you not only don't know how many items you need but also what "identity" (key) those items have. The keys will generally have a meaningful identity which you may or may not know when you write the code. Consider for example that you might have a dictionary that maps names of countries to their populations. You don't know up front what countries might be represented, and it wouldn't be practical to just list all possible countries, not least because your list might go out of date.

By the way, in Python more than in many other languages, the line between a class (the nearest thing Python has to a C struct) and a dictionary is a bit blurred, but that's something that you only learn about if you get intimately familiar with the language.

# ? Jan 11, 2016 10:41

Cingulate: Oct 23, 2012; by Fluffdaddy

Dicts were the feature that convinced me to learn Python. I've told my Python class that my personal 1st rule of Python is: "when in doubt, it's probably best to use a list comprehension" and my personal 2nd rule of Python is: "when still in doubt, it's probably best to use a dict".

Related question: I've told my Python class that my personal 1st rule of Python is: "when in doubt, it's probably best to use a list comprehension" and my personal 2nd rule of Python is: "when still in doubt, it's probably best to use a dict". Is that bad?

# ? Jan 11, 2016 14:41

Hammerite: Mar 9, 2007; And you don't remember what I said here, either, but it was pompous and stupid.; Jade Ear Joe

Cingulate posted:

Dicts were the feature that convinced me to learn Python. I've told my Python class that my personal 1st rule of Python is: "when in doubt, it's probably best to use a list comprehension" and my personal 2nd rule of Python is: "when still in doubt, it's probably best to use a dict".

Related question: I've told my Python class that my personal 1st rule of Python is: "when in doubt, it's probably best to use a list comprehension" and my personal 2nd rule of Python is: "when still in doubt, it's probably best to use a dict". Is that bad?

When do you break out the dictionary comprehensions?

# ? Jan 11, 2016 14:57

SurgicalOntologist: Jun 17, 2004

If you're teaching scientific Python I could see that as being counterproductive advice (specifically the list comprehension adage). I would hate to see `sums = [a + b for a, b in zip(array1, array2)] in scientific code.

And yeah I was going to say the same re: dict comprehensions. Those aphorisms are vague, but then again they're just aphorisms and presumably they complement some instruction so whatever. But I would say something like "when in doubt, use core data structures (usually a list, tuple, or dictionary)" and "use comprehensions to construct lists, dicts, and sets when possible".

# ? Jan 11, 2016 15:05

Symbolic Butt: Mar 22, 2009; (_!_); Buglord

SurgicalOntologist posted:

I would hate to see `sums = [a + b for a, b in zip(array1, array2)] in scientific code.

lol that's me, I write this kind of code all the time... :eng99:

# ? Jan 11, 2016 16:00

SurgicalOntologist: Jun 17, 2004

Break the habit! sums = array1 + array2 is easier in every way.

# ? Jan 11, 2016 16:10

Jewel: May 2, 2009

But python doesn't have that syntax natively.

Python2:

>>> arr1 = [5, 10, 15]
>>> arr2 = [2, 4, 6]
>>> print arr1 + arr2
[5, 10, 15, 2, 4, 6]

Python3 is the same result.

Are you sure you're not just expecting numpy or scipy or something similar to be installed by default?

# ? Jan 11, 2016 16:18

SurgicalOntologist: Jun 17, 2004

Right, that's what I was implying with "if you're teaching scientific Python" as well as by using "array" in the variable names, sorry for not being clear.

If you're not using numpy arrays, then whatever. I was talking about using numpy/scipy for certain things but still using list comprehensions for basic operations, which is something I see quite a lot.

# ? Jan 11, 2016 16:22

Cingulate: Oct 23, 2012; by Fluffdaddy

Hammerite posted:

When do you break out the dictionary comprehensions?

As SO implied - in truth, I'm of course actually saying "1st rule: when in doubt, do a list comprehension parentheses open or a dict comprehension which I'll explain later parentheses close".

SurgicalOntologist posted:

If you're teaching scientific Python I could see that as being counterproductive advice (specifically the list comprehension adage). I would hate to see `sums = [a + b for a, b in zip(array1, array2)] in scientific code.

And yeah I was going to say the same re: dict comprehensions. Those aphorisms are vague, but then again they're just aphorisms and presumably they complement some instruction so whatever. But I would say something like "when in doubt, use core data structures (usually a list, tuple, or dictionary)" and "use comprehensions to construct lists, dicts, and sets when possible".

Symbolic Butt posted:

lol that's me, I write this kind of code all the time...

SurgicalOntologist posted:

Break the habit! sums = array1 + array2 is easier in every way.

c = a + b is indeed nicer, but I like the universality of list comps. See:

df["field_1"] = [do_something_with(x, y) for x, y in zip(iterable1, iterable2)]
df["field_2"] = [x + y for x, y in zip(iterable1, iterable2)]

Of course,

df["field_2"] = iterable1 + iterable2

is better. But the point is, if in a non-toy case, you don't know the, or if there even is any, equivalent to iterable1 + iterable2, you can always fall back on the comp, and then later, when you make the code nice, replace your 145 comps with good code.

Cingulate fucked around with this message at 16:45 on Jan 11, 2016

# ? Jan 11, 2016 16:39

Cingulate: Oct 23, 2012; by Fluffdaddy

Jewel posted:

But python doesn't have that syntax natively.

Python2:

>>> arr1 = [5, 10, 15]
>>> arr2 = [2, 4, 6]
>>> print arr1 + arr2
[5, 10, 15, 2, 4, 6]

Python3 is the same result.

Are you sure you're not just expecting numpy or scipy or something similar to be installed by default?

As SO said, it's scientific python.

# ? Jan 11, 2016 16:40

Suspicious Dish: Sep 24, 2011; 2020 is the year of linux on the desktop, bro; Fun Shoe

Jewel posted:

But python doesn't have that syntax natively.

Python2:

>>> arr1 = [5, 10, 15]
>>> arr2 = [2, 4, 6]
>>> print arr1 + arr2
[5, 10, 15, 2, 4, 6]

Python3 is the same result.

Are you sure you're not just expecting numpy or scipy or something similar to be installed by default?

those are lists, not arrays, jewel

# ? Jan 11, 2016 16:45

SurgicalOntologist: Jun 17, 2004

Your field2 example just hides the issue inside a function, but it's the same thing.

But anyway, I don't disagree. I basically have two mindsets when coding in Python. Either I'm working with arrays, in which case I vectorize everything and include foo = np.asarray(foo) for public API functions. Or, I'm working with generators and iterables everywhere. For me I'm usually clearly in one camp or the other but if there was a more murky situation, yeah, I would stick to iterables.

# ? Jan 11, 2016 16:45

Cingulate: Oct 23, 2012; by Fluffdaddy

SurgicalOntologist posted:

Your field2 example just hides the issue inside a function, but it's the same thing.

But anyway, I don't disagree. I basically have two mindsets when coding in Python. Either I'm working with arrays, in which case I vectorize everything and include foo = np.asarray(foo) for public API functions. Or, I'm working with generators and iterables everywhere. For me I'm usually clearly in one camp or the other but if there was a more murky situation, yeah, I would stick to iterables.

The point is, comps are fairly universal; one form across almost all usage cases. That helps if I'm teaching beginners.

This is in the realistic situation where you're primarily a scientist and don't really care about programming, and your code starts as a prototype and never really stops being a prototype.

# ? Jan 11, 2016 16:53

SurgicalOntologist: Jun 17, 2004

Cingulate posted:

The point is, comps are fairly universal; one form across almost all usage cases. That helps if I'm teaching beginners.

Yeah, fair enough, and I agree with this as far as what to teach first. My issue is with making "if in doubt, use list comprehensions" your #1 maxim assuming you'll later teach "but in certain situations [when you have a numpy array] you may prefer a vectorized operation."

Unless you're not going to be using numpy at all, then whatever.

Cingulate posted:

This is in the realistic situation where you're primarily a scientist and don't really care about programming, and your code starts as a prototype and never really stops being a prototype.

For me this makes me more likely to code it the vectorized way first, because it's easier to code and if it's a prototype I'm not worrying about "what if they pass an iterable." This seems logical to me: with longer-term maintained code for wider use, use the more general form. When you're just whipping something together, use the specific form if it's easier to write.

# ? Jan 11, 2016 17:21

Cingulate: Oct 23, 2012; by Fluffdaddy

SurgicalOntologist posted:

Yeah, fair enough, and I agree with this as far as what to teach first. My issue is with making "if in doubt, use list comprehensions" your #1 maxim assuming you'll later teach "but in certain situations [when you have a numpy array] you may prefer a vectorized operation."

The maxim is "when in doubt, ..." after all; that is, if you can't think of how to solve a problem, you probably can find a way if you think about using comprehensions.

SurgicalOntologist posted:

Unless you're not going to be using numpy at all, then whatever.

Well, mostly pandas, but you can also do the equivalent thing in pandas of course (df["cond1"] + df["cond2"]). And that's certainly something I'm trying to convey.

SurgicalOntologist posted:

For me this makes me more likely to code it the vectorized way first, because it's easier to code and if it's a prototype I'm not worrying about "what if they pass an iterable." This seems logical to me: with longer-term maintained code for wider use, use the more general form. When you're just whipping something together, use the specific form if it's easier to write.

The thing is: in a non-toy case, where it won't be as simple as having a very transparent and intuitive shorthand (such as c = a + b) at hand, I don't want them to be frustrated, but think about the problem in a structured way. Like: you know you can do this. There probably is a way of doing it with comprehensions or dicts. Just think.

And if it's too slow/long, there probably is a better way.

# ? Jan 11, 2016 17:26

SurgicalOntologist: Jun 17, 2004

Okay, fair enough, sounds reasonable. Sorry for splitting hairs.

# ? Jan 11, 2016 17:28

Cingulate: Oct 23, 2012; by Fluffdaddy

SurgicalOntologist posted:

Okay, fair enough, sounds reasonable. Sorry for splitting hairs.

No no, I asked the question. I guess I've "successfully defended" myself here, but I'll definitely think about this the next time I teach beginners (= people even worse at Python than I am).

# ? Jan 11, 2016 17:30

Hammerite: Mar 9, 2007; And you don't remember what I said here, either, but it was pompous and stupid.; Jade Ear Joe

Suspicious Dish posted:

those are lists, not arrays, jewel

In the context of general Python programming, I would interpret "array" as synonymous with "list", since lists are the closest match for the notion of array in the Python standard library. In the context of using Python with Numpy, there might be an argument that this no longer holds; but as discussed, Numpy was at best implicitly introduced to the discussion context.

# ? Jan 11, 2016 18:05

Adbot: ADBOT LOVES YOU

# ? Jun 11, 2024 20:43

Suspicious Dish: Sep 24, 2011; 2020 is the year of linux on the desktop, bro; Fun Shoe

Hammerite posted:

In the context of general Python programming, I would interpret "array" as synonymous with "list", since lists are the closest match for the notion of array in the Python standard library.

you forgot about https://docs.python.org/2/library/array.html

as you should

# ? Jan 11, 2016 18:41

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›484 »