Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
vodkat
Jun 30, 2012



cannot legally be sold as vodka
This is probably some really lovely inelegant coding, but I would do something like this

code:
for x in soup.find_all("td"):
    x = str(x.string)
    if 'None' in x:
        x = '0.0'
    list.append(x)

Adbot
ADBOT LOVES YOU

ButtWolf
Dec 30, 2004

by Jeffrey of YOSPOS
I'm not having a problem getting the data. Just replacing it if it's blank. The html file is 4000 lines long. It seems like it should be easy. Nothing is working.
I can't use findall td, thats like 1000 things I don't need.

ButtWolf fucked around with this message at 04:35 on Jan 8, 2016

SurgicalOntologist
Jun 17, 2004

ButtWolf posted:

I'm not having a problem getting the data. Just replacing it if it's blank. The html file is 4000 lines long. It seems like it should be easy. Nothing is working.
I can't use findall td, thats like 1000 things I don't need.

What about my suggestion, find('tr').find_all('td') (see my previous post for details)

Anyways, post the rest of your code. How are you getting strings out of these objects to begin with? Once we see that it will be clearer how to help you.

ButtWolf
Dec 30, 2004

by Jeffrey of YOSPOS
code:
import repr as reprlib
import sys
import time
from bs4 import BeautifulSoup
from urllib2 import urlopen  # for Python 3: from urllib.request import urlopen

url_test = open('txt_files/testurl.txt', 'r')

f = open('txt_files/scraped_final.txt', 'w')

for line in url_test:

#html_doc = 'http://www.basketball-reference.com/players/a/arizatr01.html?lid=carousel_player' example line - file is a lot of these lines of urls
	html_doc = line
	soup = BeautifulSoup(urlopen(html_doc, "html.parser"))

	name_tag = soup.find("h1")
	player_name = name_tag.string
  	per36_line_2015 = soup.find("tr", id="per_minute.2015")
	per36_line_2016 = soup.find("tr", id="per_minute.2016")
	adv_line_2015 = soup.find("tr", id="advanced.2015")
	adv_line_2016 = soup.find("tr", id="advanced.2016")
	per_game_line_2015 = soup.find("tr", id="per_game.2015")
	per_game_line_2016 = soup.find("tr", id="per_game.2016")
	f.write("Name: " + player_name + "\n")

#print (repr (string))
	for string in per36_line_2015.stripped_strings:
		f.write(string.encode('ascii', 'ignore') + " ")
	f.write("\n")
	for string in per36_line_2016.stripped_strings:
		f.write(string.encode('ascii', 'ignore') + " ")
	f.write("\n")
	for string in adv_line_2015.stripped_strings:
		f.write(string.encode('ascii', 'ignore') + " ")
	f.write("\n")
	for string in adv_line_2016.stripped_strings:
		f.write(string.encode('ascii', 'ignore') + " ")
	f.write("\n")
	for string in per_game_line_2015.stripped_strings:
		f.write(string.encode('ascii', 'ignore') + " ")
	f.write("\n")
	for string in per_game_line_2016.stripped_strings:
		f.write(string.encode('ascii', 'ignore') + " ")
	f.write("\n")

	print player_name
	time.sleep(1.0)

############OUTPUT
Name: Steven Adams
2014-15 21 OKC NBA C 70 67 1771 4.4 8.1 .544 0.0 0.0 .000 4.4 8.1 .547 2.1 4.2 .502 4.0 6.6 10.6 1.3 0.8 1.7 2.0 4.5 10.9 
2015-16 22 OKC NBA C 36 36 852 3.9 6.8 .581 0.0 0.0 [b].000[/b] 3.9 6.8 .581 1.9 3.1 .622 3.9 5.7 9.6 1.0 0.5 1.9 1.4 4.4 9.8 
2014-15 21 OKC NBA C 70 1771 14.1 .549 .005 .514 12.2 19.3 15.8 5.5 1.1 3.8 16.8 14.3 1.9 2.2 4.1 .111 -1.4 1.8 0.4 1.1 
2015-16 22 OKC NBA C 36 852 14.4 .602 .000 .463 12.8 16.4 14.7 3.9 0.7 3.9 14.3 11.4 1.7 1.0 2.7 .153 0.4 1.2 1.7 0.8 
2014-15 21 OKC NBA C 70 67 25.3 3.1 5.7 .544 0.0 0.0 .000 3.1 5.7 .547 .544 1.5 2.9 .502 2.8 4.6 7.5 0.9 0.5 1.2 1.4 3.2 7.7 
2015-16 22 OKC NBA C 36 36 23.7 2.6 4.4 .581 0.0 0.0 [b].000[/b] 2.6 4.4 .581 .581 1.3 2.1 .622 2.6 3.7 6.3 0.7 0.3 1.3 0.9 2.9 6.4 
It works I'm just trying to clean up. I ran the program until it hit an error where it was missing a value (due to the blank not being a .000) For explanation this is for players who have not taken a 3 pointer, there % does not show up as .000 it shows nothing, so 1 out of every 20 players (of 200) I have to do this to when compiling. Not that big of a deal.
The bold .000 is what I have to add, and what I'm trying to get it to add when the td is empty.

ButtWolf fucked around with this message at 04:53 on Jan 8, 2016

SurgicalOntologist
Jun 17, 2004

Ah, I didn't know about the stripped_strings method. That's useful. Although I'm quite certain my way would have worked, you would have had to change everything after that.

Python code:
def process_float_string(string):
    encoded = string.encode('ascii', 'ignore')
    if not encoded:
        encoded = '0.0'
    return encoded + ' '
Now replace each f.write(string.encode('ascii', 'ignore') + " ") with f.write(process_float_string(string)).

Also you should look into the csv library, if not pandas, as it can take care of stuff like putting spaces between things when writing csv files.

Or... do you not even get a string when the field is empty?

SurgicalOntologist fucked around with this message at 04:55 on Jan 8, 2016

ButtWolf
Dec 30, 2004

by Jeffrey of YOSPOS

SurgicalOntologist posted:

Ah, I didn't know about the stripped_strings method. That's useful. Although I'm quite certain my way would have worked, you would have had to change everything after that.

Python code:
def process_float_string(string):
    encoded = string.encode('ascii', 'ignore')
    if not encoded:
        encoded = '0.0'
    return encoded + ' '
Now replace each f.write(string.encode('ascii', 'ignore') + " ") with f.write(process_float_string(string)).

Also you should look into the csv library, if not pandas, as it can take care of stuff like putting spaces between things when writing csv files.

Or... do you not even get a string when the field is empty?

The output up there is exactly what the text file is per36_line_2015[4] is C, [10] is .544 - I'm doing a lot of calculations based on all of this stuff in another file. I need the file *exactly like it is. *.000 added where it doesn't read anything in the td in the corresponding tr.
SO there is the problem, see how lines 1 and 2 have values that are similar up until those 0s in the middle then a .XXX on line 1 but not on line 2. Thats position [13] so if I wanted to grab it should be 000, but if I run for 2016, it reads 3.9 which is offensive rebounds I think and if I try to pull value [28], the last one - points, it won't read - out of range.

I though beautfiulsoup would have something cool built in but I can't find it. I'm going to bed. Thanks for your help anyway guys.

ButtWolf fucked around with this message at 05:12 on Jan 8, 2016

SurgicalOntologist
Jun 17, 2004

The idea is if not encoded tests for an empty string, and replaces it with your placeholder value. Or is the problem that you're not getting a string at all, even an empty one? If that's the case you'll need to go back to my previous suggestion instead of the stripped_strings method. If stripped_strings simply skips empty fields (rather than giving you an empty string) then you shouldn't use it.

ButtWolf
Dec 30, 2004

by Jeffrey of YOSPOS
So you think it's a string output problem, not when I'm reading it? That's where we were confused. No that doesn't seem to work. None of it works. I'm burning everything down. kbye

ButtWolf fucked around with this message at 05:15 on Jan 8, 2016

SurgicalOntologist
Jun 17, 2004

ButtWolf posted:

I need the file *exactly like it is. *.000 added where it doesn't read anything in the td in the corresponding tr.

Do what you want, but you're giving yourself lots of headaches by tooling your own "write tabular data to a file" function. For example if you used commas then you'd have an easier time finding empty fields.

ButtWolf posted:

I though beautfiulsoup would have something cool built in but I can't find it.

I don't think BeautifulSoup has anything specifically for data handling (which is really what you're asking for). You want pandas, as I suggested before.

Yes, you'll need to retool your following steps as well, but it would be relatively simple using a library designed for data.

For example:

Python code:
import sys
import pandas  as pd


def fix_table(table):
    for column in table.columns:
        if column.startswith('Unnamed:'):
            del table[column]
    return table


url = 'http://www.basketball-reference.com/players/a/arizatr01.html?lid=carousel_player'

tables = pd.read_html(url)

per_36 = fix_table(tables[2])
per_36.to_csv(sys.stdout, index=False)

advanced = fix_table(tables[4])
advanced.to_csv(sys.stdout, index=False)
gives me

code:
Season,Age,Tm,Lg,Pos,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
2004-05,19.0,NYK,NBA,SF,80,12,1382,4.5,10.1,0.442,0.1,0.3,0.231,4.4,9.8,0.449,3.2,4.5,0.695,2.3,4.0,6.3,2.2,1.8,0.5,1.9,3.9,12.2
2005-06,20.0,TOT,NBA,SF,57,10,999,3.4,8.1,0.41200000000000003,0.0,0.2,0.2,3.3,8.0,0.41600000000000004,2.8,4.6,0.606,2.8,5.0,7.9,2.2,2.1,0.4,2.2,3.6,9.5
2005-06,20.0,NYK,NBA,SF,36,10,709,3.1,7.4,0.418,0.1,0.2,0.33299999999999996,3.0,7.3,0.42,2.1,3.9,0.545,2.5,4.5,7.0,2.3,2.2,0.5,2.3,4.0,8.4
2005-06,20.0,ORL,NBA,SF,21,0,290,4.0,9.9,0.4,0.0,0.2,0.0,4.0,9.7,0.41,4.3,6.2,0.7,3.6,6.5,10.1,1.9,1.7,0.2,2.0,2.7,12.3
2006-07,21.0,ORL,NBA,SG,57,7,1278,5.6,10.5,0.539,0.0,0.2,0.0,5.6,10.3,0.5489999999999999,3.0,4.8,0.62,2.8,4.3,7.0,1.8,1.7,0.5,2.4,3.8,14.3
2007-08,22.0,TOT,NBA,SF,35,3,546,4.5,9.0,0.507,0.3,1.2,0.278,4.2,7.8,0.542,3.2,4.9,0.653,2.0,5.1,7.1,2.9,2.0,0.7,1.5,3.0,12.7
2007-08,22.0,ORL,NBA,SF,11,0,115,4.4,9.7,0.452,0.0,0.9,0.0,4.4,8.8,0.5,2.5,4.7,0.5329999999999999,1.6,5.9,7.5,2.5,1.6,0.9,1.6,2.8,11.3
2007-08,22.0,LAL,NBA,SF,24,3,431,4.6,8.8,0.524,0.4,1.3,0.33299999999999996,4.2,7.5,0.556,3.4,5.0,0.6829999999999999,2.1,4.9,7.0,3.0,2.2,0.7,1.5,3.0,13.0
2008-09,23.0,LAL,NBA,SF,82,20,1999,4.9,10.7,0.46,1.1,3.4,0.319,3.8,7.3,0.526,2.2,3.0,0.71,2.0,4.3,6.3,2.6,2.5,0.4,1.6,3.0,13.1
2009-10,24.0,HOU,NBA,SF,72,71,2629,5.4,13.7,0.39399999999999996,1.9,5.6,0.33399999999999996,3.5,8.1,0.436,2.0,3.1,0.649,1.1,4.5,5.5,3.8,1.7,0.5,2.2,2.2,14.7
2010-11,25.0,NOH,NBA,SF,75,75,2600,4.2,10.6,0.39799999999999996,1.1,3.8,0.303,3.1,6.8,0.45,1.9,2.7,0.701,0.8,4.8,5.6,2.2,1.7,0.4,1.6,2.5,11.4
2011-12,26.0,NOH,NBA,SF,41,41,1350,4.5,10.7,0.41700000000000004,0.8,2.3,0.33299999999999996,3.7,8.4,0.44,2.1,2.7,0.775,1.1,4.6,5.7,3.6,1.8,0.7,2.0,1.9,11.8
2012-13,27.0,WAS,NBA,SF,56,15,1471,4.6,11.0,0.41700000000000004,1.9,5.1,0.364,2.7,5.9,0.46299999999999997,1.9,2.3,0.821,1.1,5.4,6.5,2.8,1.8,0.5,2.1,1.8,13.0
2013-14,28.0,WAS,NBA,SF,77,77,2723,5.1,11.3,0.456,2.4,5.8,0.40700000000000003,2.8,5.4,0.509,2.0,2.6,0.772,1.3,5.0,6.3,2.5,1.7,0.3,1.7,2.4,14.6
2014-15,29.0,HOU,NBA,SF,82,82,2930,4.5,11.2,0.402,2.4,6.8,0.35,2.1,4.4,0.485,1.5,1.8,0.853,0.9,4.7,5.6,2.6,1.9,0.2,1.7,2.3,12.9
2015-16,30.0,HOU,NBA,SF,35,35,1192,4.2,10.8,0.39,2.1,6.3,0.33799999999999997,2.1,4.6,0.461,1.4,1.8,0.754,0.7,4.5,5.2,2.1,1.9,0.2,1.7,2.4,12.0
Career,,,NBA,,749,448,21099,4.7,11.0,0.42700000000000005,1.4,4.1,0.34700000000000003,3.3,6.9,0.475,2.1,3.0,0.711,1.4,4.7,6.1,2.7,1.8,0.4,1.9,2.6,12.9
Season,Age,Tm,Lg,Pos,G,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP
2004-05,19.0,NYK,NBA,SF,80,1382,13.3,0.503,0.033,0.447,7.5,13.1,10.3,9.8,2.7,1.0,13.6,17.8,0.8,1.3,2.1,0.073,-1.6,0.7,-1.0,0.4
2005-06,20.0,TOT,NBA,SF,57,999,11.8,0.46799999999999997,0.022000000000000002,0.562,9.6,17.1,13.3,9.8,3.0,0.9,18.0,15.8,-0.1,1.0,0.9,0.043,-2.1,1.5,-0.6,0.4
2005-06,20.0,NYK,NBA,SF,36,709,10.2,0.45899999999999996,0.021,0.527,8.3,15.3,11.7,10.4,3.2,1.0,20.4,14.4,-0.4,0.6,0.3,0.02,-2.3,2.0,-0.3,0.3
2005-06,20.0,ORL,NBA,SF,21,290,15.8,0.485,0.025,0.625,12.7,21.5,17.3,8.3,2.6,0.5,13.6,19.3,0.2,0.4,0.6,0.10099999999999999,-1.6,0.2,-1.4,0.0
2006-07,21.0,ORL,NBA,SG,57,1278,16.2,0.5670000000000001,0.019,0.461,9.7,14.2,12.0,8.9,2.5,1.2,15.8,19.2,1.5,1.9,3.4,0.128,0.2,1.2,1.5,1.1
2007-08,22.0,TOT,NBA,SF,35,546,16.1,0.568,0.132,0.551,6.4,15.4,11.1,11.8,2.9,1.5,12.0,15.6,0.9,0.9,1.8,0.157,0.0,1.8,1.8,0.5
2007-08,22.0,ORL,NBA,SF,11,115,12.7,0.479,0.09699999999999999,0.484,5.2,18.3,12.0,10.7,2.2,1.9,11.7,17.0,0.0,0.2,0.2,0.085,-3.4,1.9,-1.5,0.0
2007-08,22.0,LAL,NBA,SF,24,431,17.1,0.594,0.14300000000000002,0.5710000000000001,6.7,14.6,10.8,12.0,3.0,1.3,12.0,15.3,0.9,0.7,1.6,0.17600000000000002,0.9,1.8,2.7,0.5
2008-09,23.0,LAL,NBA,SF,82,1999,15.5,0.544,0.32,0.284,6.4,13.4,10.0,10.4,3.5,0.9,11.5,16.6,2.5,3.5,6.1,0.146,1.0,2.1,3.1,2.6
2009-10,24.0,HOU,NBA,SF,72,2629,13.3,0.488,0.40700000000000003,0.228,3.3,14.7,8.8,16.7,2.4,1.1,12.8,21.2,0.1,3.1,3.2,0.057999999999999996,0.0,1.3,1.3,2.2
2010-11,25.0,NOH,NBA,SF,75,2600,11.3,0.48700000000000004,0.35600000000000004,0.259,2.8,16.4,9.5,10.0,2.5,1.0,12.2,17.7,-0.4,3.8,3.4,0.062,-1.2,2.3,1.1,2.0
2011-12,26.0,NOH,NBA,SF,41,1350,14.2,0.496,0.21600000000000003,0.253,3.7,15.1,9.5,16.8,2.8,1.5,14.5,18.4,0.5,1.7,2.2,0.078,-0.3,2.2,1.8,1.3
2012-13,27.0,WAS,NBA,SF,56,1471,14.0,0.5379999999999999,0.46299999999999997,0.21100000000000002,3.3,16.9,10.0,12.7,2.5,1.1,14.6,17.9,0.7,2.4,3.1,0.102,-0.4,1.9,1.5,1.3
2013-14,28.0,WAS,NBA,SF,77,2723,15.8,0.59,0.518,0.226,4.1,16.3,10.1,10.8,2.4,0.6,12.3,17.8,4.3,3.7,8.0,0.141,2.3,1.1,3.4,3.6
2014-15,29.0,HOU,NBA,SF,82,2930,12.7,0.539,0.61,0.157,2.9,14.4,8.6,11.1,2.6,0.5,12.7,16.5,2.7,3.9,6.6,0.10800000000000001,0.6,1.3,1.9,2.9
2015-16,30.0,HOU,NBA,SF,35,1192,11.0,0.513,0.5770000000000001,0.17,2.1,13.8,7.9,9.2,2.7,0.5,12.5,16.1,0.4,0.8,1.2,0.048,-0.4,-0.1,-0.4,0.5
Career,,,NBA,,749,21099,13.7,0.525,0.374,0.26899999999999996,4.6,15.1,9.8,11.7,2.7,0.9,13.2,17.8,13.8,28.1,41.9,0.095,0.1,1.5,1.5,18.7
Now pandas also has facilities for handling missing data in your computations so the blanks won't be an issue in your next steps. "Worst case scenario" you'd add something like .fillna(0.0)

SurgicalOntologist
Jun 17, 2004

ButtWolf posted:

No that doesn't seem to work. None of it works. I'm burning everything down. kbye

Umm can you be more specific? What did you try, and what happened? You're impossible to help.

ButtWolf
Dec 30, 2004

by Jeffrey of YOSPOS
I tried every single thing you suggested and nothing did what I needed. How am I impossible? I'm not getting mad or anything I'm lightheartedly joking and wanting to go to sleep.

SurgicalOntologist
Jun 17, 2004

Because you're not saying exactly what you did or what happened.

What you did: "what you suggested" isn't specific enough. Only my most recent post gave an entire example. For example maybe you did something like assuming the line I gave was a drop in replacement for an existing line in your code. And my real suggestion "use the CSV module" is not specific enough to know what you actually yield.

What happened: error? same output as before? something else? In either case due diligence on your end would be to step through the code and see what values the variable in question is taking at the critical moment.

ButtWolf
Dec 30, 2004

by Jeffrey of YOSPOS

SurgicalOntologist posted:

Because you're not saying exactly what you did or what happened.

What you did: "what you suggested" isn't specific enough. Only my most recent post gave an entire example. For example maybe you did something like assuming the line I gave was a drop in replacement for an existing line in your code. And my real suggestion "use the CSV module" is not specific enough to know what you actually yield.

What happened: error? same output as before? something else? In either case due diligence on your end would be to step through the code and see what values the variable in question is taking at the critical moment.

Whatever. It now just feels like you are mad at me for not knowing what I'm doing. I'm sorry if I wasn't explaining myself, but I'm new and not good and I only try this crap about once a year, usually because of this. It just seems like all the time, it's "I'm one little thing away from it being done and just the way I want, can I get help?" --- "Rewrite the entire thing, using this other library that you don't know about or some really advanced way of doing it." It's about 6 little files, each 40-100 lines. And this is basically all I need. One <td> file that is empty 10% of the time.

It's frustrating and I know I may have problems explaining myself, so I'm sorry. I'm just gonna go to sleep.

SurgicalOntologist
Jun 17, 2004

Fair enough, sorry I was frustrated too.

But to be fair, using some "really advanced" library or module often actually requires less effort, since the hard work has been done by someone else. As I hoped would be clear from the code I posted.

And just because it seems like "one little thing" is wrong, doesn't mean that there's one corresponding little fix. In this case, my guess is that stripped_strings was skipping empty fields rather than giving you an empty string. This required doing it another way than stripped_strings. So already there's going to be some work involved.

I suggested another way, but that involved using more BeautfulSoup finds. Since these can have issues depending on the page (i.e. you often need to add some specifiers to get just the elements you want) there is a trial-and-error process involved. Doing this trial-and-error in the forums is pretty much impossible, so here you go, I did it for you. It turned out that getting all the trs gets you too many, so I needed to specify the class. Otherwise this is pretty much what I was suggesting before.

Python code:
from bs4 import BeautifulSoup
import requests


def process_table(table_element):
    lines = []
    for tr in table_element.find_all('tr', attrs={'class': 'full_table'}):
        lines.append([td.string or '.000' for td in tr.find_all('td')])
    return lines


def write_table(file, table_list):
    file.writelines(' '.join(line) + '\n' for line in table_list)


url = 'http://www.basketball-reference.com/players/a/arizatr01.html?lid=carousel_player'
output_path = 'output.txt'

soup = BeautifulSoup(requests.get(url).text, 'lxml')

with open(output_path, 'w') as f:
    write_table(f, process_table(soup.find('table', id='per_minute')))
    write_table(f, process_table(soup.find('table', id='advanced')))

I didn't feel like troubleshooting all that encoding BS so I just used Python 3. Sorry if you can't use it too, but if not you can probably stick the logic you were using into write_table.

SurgicalOntologist fucked around with this message at 05:49 on Jan 8, 2016

SurgicalOntologist
Jun 17, 2004

And to atone for being snappy (and to procastinate on my own work) I attempted to improve the readability a bit, so you could understand what I did. This should also make it easier to add back in your logic for handling multiple urls.

Python code:
from bs4 import BeautifulSoup
import requests


def tables_from_url(url):
    soup = BeautifulSoup(requests.get(url).text, 'lxml')

    name = soup.find('h1').string
    print('reading tables for {}...'.format(name), end='')

    table_ids = ['per_minute', 'advanced']
    tables = [read_table(soup.find('table', id=table_id)) for table_id in table_ids]

    print('done.')
    return tables


def read_table(table_element):
    return [read_line(line) for line in table_element.find_all('tr', attrs={'class': 'full_table'})]


def read_line(line_element):
    return [read_field(field) for field in line_element.find_all('td')]


def read_field(field_element):
    return field_element.string or '.000'


def write_table(file, table_list):
    file.writelines(' '.join(line) + '\n' for line in table_list)


url = 'http://www.basketball-reference.com/players/a/arizatr01.html?lid=carousel_player'
output_path = 'output.txt'

with open(output_path, 'w') as f:
    for table in tables_from_url(url):
        write_table(f, table)

SurgicalOntologist fucked around with this message at 06:12 on Jan 8, 2016

floppo
Aug 24, 2005
Another BeautifulSoup type question:

I am scraping links from an html that has a 'pager'. There is some javascript at the end of the html file that seems to govern what the pager does, but basically it goes through the pages of the results without updating the web address in the browser, and I can't figure out how to get the updated html file resulting from going to the next page in an automatic way. Here is the pager in html when I'm on page 8 of the results for example:

<div class="pager">
<span class="current">
8 / 370 page, 7384 results found total
</span>
</div>

</div>
<div class="pager form">
<a href="#" onclick="setPage('7');return false;" class="prev ">R<span>‹</span></a>
<a href="#" onclick="setPage('9');return false;" class="next ">R<span>›</span></a>
</div>

</div>
</div>

Can BS handle this somehow?

Kuule hain nussivan
Nov 27, 2008

floppo posted:

Another BeautifulSoup type question:

I am scraping links from an html that has a 'pager'. There is some javascript at the end of the html file that seems to govern what the pager does, but basically it goes through the pages of the results without updating the web address in the browser, and I can't figure out how to get the updated html file resulting from going to the next page in an automatic way. Here is the pager in html when I'm on page 8 of the results for example:

<div class="pager">
<span class="current">
8 / 370 page, 7384 results found total
</span>
</div>

</div>
<div class="pager form">
<a href="#" onclick="setPage('7');return false;" class="prev ">R<span>‹</span></a>
<a href="#" onclick="setPage('9');return false;" class="next ">R<span>›</span></a>
</div>

</div>
</div>

Can BS handle this somehow?
I think the common way of doing this is using Selenium to interact with the page and BS to parse the result.

floppo
Aug 24, 2005
thanks - Selenium is working very well

pmchem
Jan 22, 2010


There's a general problem with python when using it for jobs with large core counts on HPC machines: the file system performance hit on importing modules is quite significant. One approach to solving this problem is here:
https://github.com/rainwoodman/python-mpi-bcast

There are other approaches. Has anyone else in this thread had to deal with this problem? What is your preferred (and preferably low effort) solution?

Highblood
May 20, 2012

Let's talk about tactics.
Stupid question incoming: Big newbie to programming here. What's the point of dictionaries? I'm used to the more C-like struct and they seem similar but I don't get the differences. Why assign values to strings? Why not just a list of variables (man reaching for that " key is really hard okay)? From what I understand you can add and remove entries to the dictionary whenever you want but the applications for that fly completely over my head (I'm dumb sorry)

I just feel like whatever I could do with that could just as easily be done without it (C has no associative arrays right?)

Hopefully nobody got an aneurysm from that question.

Asymmetrikon
Oct 30, 2009

I believe you're a big dork!
Being able to have a structure hold a runtime-variable amount of entries is really important to a lot of use cases - what if the user needs to be able to add more things to your "list of variables" on the fly, like if you have some kind of list program? What if you don't necessarily know the exact form of the struct you're going to receive because of possible errors or unforseen problems, like doing interaction over the web via JSON? Dictionaries are trivially searchable by their key, which is another neat feature that is more inconvenient without them.

A feature not being in C isn't particularly telling, either; C doesn't have a lot of things built-in most modern languages have, like lists, try/catch, generic types, lambdas, first-class functions, etc. We still like to have those things.

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

Highblood posted:

Stupid question incoming: Big newbie to programming here. What's the point of dictionaries? I'm used to the more C-like struct [...]

Well those two (dictionaries and C structs) are fairly different things. The members of a struct are hard coded whereas you can put what you want into a dictionary at runtime. Your question is a bit like asking what's the use of Lists, or of vector from C++; what's the point when I can just declare an array big enough to hold what I want? The answer to that is another question: what if you don't know how big you want it to be?

Going back to the dictionary example, maybe you not only don't know how many items you need but also what "identity" (key) those items have. The keys will generally have a meaningful identity which you may or may not know when you write the code. Consider for example that you might have a dictionary that maps names of countries to their populations. You don't know up front what countries might be represented, and it wouldn't be practical to just list all possible countries, not least because your list might go out of date.

By the way, in Python more than in many other languages, the line between a class (the nearest thing Python has to a C struct) and a dictionary is a bit blurred, but that's something that you only learn about if you get intimately familiar with the language.

Cingulate
Oct 23, 2012

by Fluffdaddy
Dicts were the feature that convinced me to learn Python. I've told my Python class that my personal 1st rule of Python is: "when in doubt, it's probably best to use a list comprehension" and my personal 2nd rule of Python is: "when still in doubt, it's probably best to use a dict".

Related question: I've told my Python class that my personal 1st rule of Python is: "when in doubt, it's probably best to use a list comprehension" and my personal 2nd rule of Python is: "when still in doubt, it's probably best to use a dict". Is that bad?

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

Cingulate posted:

Dicts were the feature that convinced me to learn Python. I've told my Python class that my personal 1st rule of Python is: "when in doubt, it's probably best to use a list comprehension" and my personal 2nd rule of Python is: "when still in doubt, it's probably best to use a dict".

Related question: I've told my Python class that my personal 1st rule of Python is: "when in doubt, it's probably best to use a list comprehension" and my personal 2nd rule of Python is: "when still in doubt, it's probably best to use a dict". Is that bad?

When do you break out the dictionary comprehensions?

SurgicalOntologist
Jun 17, 2004

If you're teaching scientific Python I could see that as being counterproductive advice (specifically the list comprehension adage). I would hate to see `sums = [a + b for a, b in zip(array1, array2)] in scientific code.

And yeah I was going to say the same re: dict comprehensions. Those aphorisms are vague, but then again they're just aphorisms and presumably they complement some instruction so whatever. But I would say something like "when in doubt, use core data structures (usually a list, tuple, or dictionary)" and "use comprehensions to construct lists, dicts, and sets when possible".

Symbolic Butt
Mar 22, 2009

(_!_)
Buglord

SurgicalOntologist posted:

I would hate to see `sums = [a + b for a, b in zip(array1, array2)] in scientific code.

lol that's me, I write this kind of code all the time... :eng99:

SurgicalOntologist
Jun 17, 2004

Break the habit! sums = array1 + array2 is easier in every way.

Jewel
May 2, 2009

:confused: But python doesn't have that syntax natively.

Python2:

>>> arr1 = [5, 10, 15]
>>> arr2 = [2, 4, 6]
>>> print arr1 + arr2
[5, 10, 15, 2, 4, 6]

Python3 is the same result.

Are you sure you're not just expecting numpy or scipy or something similar to be installed by default?

SurgicalOntologist
Jun 17, 2004

Right, that's what I was implying with "if you're teaching scientific Python" as well as by using "array" in the variable names, sorry for not being clear.

If you're not using numpy arrays, then whatever. I was talking about using numpy/scipy for certain things but still using list comprehensions for basic operations, which is something I see quite a lot.

Cingulate
Oct 23, 2012

by Fluffdaddy

Hammerite posted:

When do you break out the dictionary comprehensions?
As SO implied - in truth, I'm of course actually saying "1st rule: when in doubt, do a list comprehension parentheses open or a dict comprehension which I'll explain later parentheses close".

SurgicalOntologist posted:

If you're teaching scientific Python I could see that as being counterproductive advice (specifically the list comprehension adage). I would hate to see `sums = [a + b for a, b in zip(array1, array2)] in scientific code.

And yeah I was going to say the same re: dict comprehensions. Those aphorisms are vague, but then again they're just aphorisms and presumably they complement some instruction so whatever. But I would say something like "when in doubt, use core data structures (usually a list, tuple, or dictionary)" and "use comprehensions to construct lists, dicts, and sets when possible".

Symbolic Butt posted:

lol that's me, I write this kind of code all the time... :eng99:

SurgicalOntologist posted:

Break the habit! sums = array1 + array2 is easier in every way.
c = a + b is indeed nicer, but I like the universality of list comps. See:

df["field_1"] = [do_something_with(x, y) for x, y in zip(iterable1, iterable2)]
df["field_2"] = [x + y for x, y in zip(iterable1, iterable2)]

Of course,

df["field_2"] = iterable1 + iterable2

is better. But the point is, if in a non-toy case, you don't know the, or if there even is any, equivalent to iterable1 + iterable2, you can always fall back on the comp, and then later, when you make the code nice, replace your 145 comps with good code.

Cingulate fucked around with this message at 16:45 on Jan 11, 2016

Cingulate
Oct 23, 2012

by Fluffdaddy

Jewel posted:

:confused: But python doesn't have that syntax natively.

Python2:

>>> arr1 = [5, 10, 15]
>>> arr2 = [2, 4, 6]
>>> print arr1 + arr2
[5, 10, 15, 2, 4, 6]

Python3 is the same result.

Are you sure you're not just expecting numpy or scipy or something similar to be installed by default?
As SO said, it's scientific python.

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

Jewel posted:

:confused: But python doesn't have that syntax natively.

Python2:

>>> arr1 = [5, 10, 15]
>>> arr2 = [2, 4, 6]
>>> print arr1 + arr2
[5, 10, 15, 2, 4, 6]

Python3 is the same result.

Are you sure you're not just expecting numpy or scipy or something similar to be installed by default?

those are lists, not arrays, jewel

SurgicalOntologist
Jun 17, 2004

Your field2 example just hides the issue inside a function, but it's the same thing.

But anyway, I don't disagree. I basically have two mindsets when coding in Python. Either I'm working with arrays, in which case I vectorize everything and include foo = np.asarray(foo) for public API functions. Or, I'm working with generators and iterables everywhere. For me I'm usually clearly in one camp or the other but if there was a more murky situation, yeah, I would stick to iterables.

Cingulate
Oct 23, 2012

by Fluffdaddy

SurgicalOntologist posted:

Your field2 example just hides the issue inside a function, but it's the same thing.

But anyway, I don't disagree. I basically have two mindsets when coding in Python. Either I'm working with arrays, in which case I vectorize everything and include foo = np.asarray(foo) for public API functions. Or, I'm working with generators and iterables everywhere. For me I'm usually clearly in one camp or the other but if there was a more murky situation, yeah, I would stick to iterables.
The point is, comps are fairly universal; one form across almost all usage cases. That helps if I'm teaching beginners.

This is in the realistic situation where you're primarily a scientist and don't really care about programming, and your code starts as a prototype and never really stops being a prototype.

SurgicalOntologist
Jun 17, 2004

Cingulate posted:

The point is, comps are fairly universal; one form across almost all usage cases. That helps if I'm teaching beginners.
Yeah, fair enough, and I agree with this as far as what to teach first. My issue is with making "if in doubt, use list comprehensions" your #1 maxim assuming you'll later teach "but in certain situations [when you have a numpy array] you may prefer a vectorized operation."

Unless you're not going to be using numpy at all, then whatever.

Cingulate posted:

This is in the realistic situation where you're primarily a scientist and don't really care about programming, and your code starts as a prototype and never really stops being a prototype.
For me this makes me more likely to code it the vectorized way first, because it's easier to code and if it's a prototype I'm not worrying about "what if they pass an iterable." This seems logical to me: with longer-term maintained code for wider use, use the more general form. When you're just whipping something together, use the specific form if it's easier to write.

Cingulate
Oct 23, 2012

by Fluffdaddy

SurgicalOntologist posted:

Yeah, fair enough, and I agree with this as far as what to teach first. My issue is with making "if in doubt, use list comprehensions" your #1 maxim assuming you'll later teach "but in certain situations [when you have a numpy array] you may prefer a vectorized operation."
The maxim is "when in doubt, ..." after all; that is, if you can't think of how to solve a problem, you probably can find a way if you think about using comprehensions.

SurgicalOntologist posted:

Unless you're not going to be using numpy at all, then whatever.
Well, mostly pandas, but you can also do the equivalent thing in pandas of course (df["cond1"] + df["cond2"]). And that's certainly something I'm trying to convey.

SurgicalOntologist posted:

For me this makes me more likely to code it the vectorized way first, because it's easier to code and if it's a prototype I'm not worrying about "what if they pass an iterable." This seems logical to me: with longer-term maintained code for wider use, use the more general form. When you're just whipping something together, use the specific form if it's easier to write.
The thing is: in a non-toy case, where it won't be as simple as having a very transparent and intuitive shorthand (such as c = a + b) at hand, I don't want them to be frustrated, but think about the problem in a structured way. Like: you know you can do this. There probably is a way of doing it with comprehensions or dicts. Just think.

And if it's too slow/long, there probably is a better way.

SurgicalOntologist
Jun 17, 2004

Okay, fair enough, sounds reasonable. Sorry for splitting hairs.

Cingulate
Oct 23, 2012

by Fluffdaddy

SurgicalOntologist posted:

Okay, fair enough, sounds reasonable. Sorry for splitting hairs.
No no, I asked the question. I guess I've "successfully defended" myself here, but I'll definitely think about this the next time I teach beginners (= people even worse at Python than I am).

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

Suspicious Dish posted:

those are lists, not arrays, jewel

In the context of general Python programming, I would interpret "array" as synonymous with "list", since lists are the closest match for the notion of array in the Python standard library. In the context of using Python with Numpy, there might be an argument that this no longer holds; but as discussed, Numpy was at best implicitly introduced to the discussion context.

Adbot
ADBOT LOVES YOU

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

Hammerite posted:

In the context of general Python programming, I would interpret "array" as synonymous with "list", since lists are the closest match for the notion of array in the Python standard library.

you forgot about https://docs.python.org/2/library/array.html

as you should

  • Locked thread