|
This is probably some really lovely inelegant coding, but I would do something like thiscode:
|
# ? Jan 8, 2016 04:30 |
|
|
# ? Jun 11, 2024 20:43 |
|
I'm not having a problem getting the data. Just replacing it if it's blank. The html file is 4000 lines long. It seems like it should be easy. Nothing is working. I can't use findall td, thats like 1000 things I don't need. ButtWolf fucked around with this message at 04:35 on Jan 8, 2016 |
# ? Jan 8, 2016 04:32 |
|
ButtWolf posted:I'm not having a problem getting the data. Just replacing it if it's blank. The html file is 4000 lines long. It seems like it should be easy. Nothing is working. What about my suggestion, find('tr').find_all('td') (see my previous post for details) Anyways, post the rest of your code. How are you getting strings out of these objects to begin with? Once we see that it will be clearer how to help you.
|
# ? Jan 8, 2016 04:37 |
|
code:
The bold .000 is what I have to add, and what I'm trying to get it to add when the td is empty. ButtWolf fucked around with this message at 04:53 on Jan 8, 2016 |
# ? Jan 8, 2016 04:47 |
|
Ah, I didn't know about the stripped_strings method. That's useful. Although I'm quite certain my way would have worked, you would have had to change everything after that.Python code:
Also you should look into the csv library, if not pandas, as it can take care of stuff like putting spaces between things when writing csv files. Or... do you not even get a string when the field is empty? SurgicalOntologist fucked around with this message at 04:55 on Jan 8, 2016 |
# ? Jan 8, 2016 04:52 |
|
SurgicalOntologist posted:Ah, I didn't know about the stripped_strings method. That's useful. Although I'm quite certain my way would have worked, you would have had to change everything after that. The output up there is exactly what the text file is per36_line_2015[4] is C, [10] is .544 - I'm doing a lot of calculations based on all of this stuff in another file. I need the file *exactly like it is. *.000 added where it doesn't read anything in the td in the corresponding tr. SO there is the problem, see how lines 1 and 2 have values that are similar up until those 0s in the middle then a .XXX on line 1 but not on line 2. Thats position [13] so if I wanted to grab it should be 000, but if I run for 2016, it reads 3.9 which is offensive rebounds I think and if I try to pull value [28], the last one - points, it won't read - out of range. I though beautfiulsoup would have something cool built in but I can't find it. I'm going to bed. Thanks for your help anyway guys. ButtWolf fucked around with this message at 05:12 on Jan 8, 2016 |
# ? Jan 8, 2016 04:59 |
|
The idea is if not encoded tests for an empty string, and replaces it with your placeholder value. Or is the problem that you're not getting a string at all, even an empty one? If that's the case you'll need to go back to my previous suggestion instead of the stripped_strings method. If stripped_strings simply skips empty fields (rather than giving you an empty string) then you shouldn't use it.
|
# ? Jan 8, 2016 05:09 |
|
So you think it's a string output problem, not when I'm reading it? That's where we were confused. No that doesn't seem to work. None of it works. I'm burning everything down. kbye
ButtWolf fucked around with this message at 05:15 on Jan 8, 2016 |
# ? Jan 8, 2016 05:13 |
|
ButtWolf posted:I need the file *exactly like it is. *.000 added where it doesn't read anything in the td in the corresponding tr. Do what you want, but you're giving yourself lots of headaches by tooling your own "write tabular data to a file" function. For example if you used commas then you'd have an easier time finding empty fields. ButtWolf posted:I though beautfiulsoup would have something cool built in but I can't find it. I don't think BeautifulSoup has anything specifically for data handling (which is really what you're asking for). You want pandas, as I suggested before. Yes, you'll need to retool your following steps as well, but it would be relatively simple using a library designed for data. For example: Python code:
code:
|
# ? Jan 8, 2016 05:15 |
|
ButtWolf posted:No that doesn't seem to work. None of it works. I'm burning everything down. kbye Umm can you be more specific? What did you try, and what happened? You're impossible to help.
|
# ? Jan 8, 2016 05:17 |
|
I tried every single thing you suggested and nothing did what I needed. How am I impossible? I'm not getting mad or anything I'm lightheartedly joking and wanting to go to sleep.
|
# ? Jan 8, 2016 05:19 |
|
Because you're not saying exactly what you did or what happened. What you did: "what you suggested" isn't specific enough. Only my most recent post gave an entire example. For example maybe you did something like assuming the line I gave was a drop in replacement for an existing line in your code. And my real suggestion "use the CSV module" is not specific enough to know what you actually yield. What happened: error? same output as before? something else? In either case due diligence on your end would be to step through the code and see what values the variable in question is taking at the critical moment.
|
# ? Jan 8, 2016 05:23 |
|
SurgicalOntologist posted:Because you're not saying exactly what you did or what happened. Whatever. It now just feels like you are mad at me for not knowing what I'm doing. I'm sorry if I wasn't explaining myself, but I'm new and not good and I only try this crap about once a year, usually because of this. It just seems like all the time, it's "I'm one little thing away from it being done and just the way I want, can I get help?" --- "Rewrite the entire thing, using this other library that you don't know about or some really advanced way of doing it." It's about 6 little files, each 40-100 lines. And this is basically all I need. One <td> file that is empty 10% of the time. It's frustrating and I know I may have problems explaining myself, so I'm sorry. I'm just gonna go to sleep.
|
# ? Jan 8, 2016 05:33 |
|
Fair enough, sorry I was frustrated too. But to be fair, using some "really advanced" library or module often actually requires less effort, since the hard work has been done by someone else. As I hoped would be clear from the code I posted. And just because it seems like "one little thing" is wrong, doesn't mean that there's one corresponding little fix. In this case, my guess is that stripped_strings was skipping empty fields rather than giving you an empty string. This required doing it another way than stripped_strings. So already there's going to be some work involved. I suggested another way, but that involved using more BeautfulSoup finds. Since these can have issues depending on the page (i.e. you often need to add some specifiers to get just the elements you want) there is a trial-and-error process involved. Doing this trial-and-error in the forums is pretty much impossible, so here you go, I did it for you. It turned out that getting all the trs gets you too many, so I needed to specify the class. Otherwise this is pretty much what I was suggesting before. Python code:
SurgicalOntologist fucked around with this message at 05:49 on Jan 8, 2016 |
# ? Jan 8, 2016 05:39 |
|
And to atone for being snappy (and to procastinate on my own work) I attempted to improve the readability a bit, so you could understand what I did. This should also make it easier to add back in your logic for handling multiple urls.Python code:
SurgicalOntologist fucked around with this message at 06:12 on Jan 8, 2016 |
# ? Jan 8, 2016 06:07 |
|
Another BeautifulSoup type question: I am scraping links from an html that has a 'pager'. There is some javascript at the end of the html file that seems to govern what the pager does, but basically it goes through the pages of the results without updating the web address in the browser, and I can't figure out how to get the updated html file resulting from going to the next page in an automatic way. Here is the pager in html when I'm on page 8 of the results for example: <div class="pager"> <span class="current"> 8 / 370 page, 7384 results found total </span> </div> </div> <div class="pager form"> <a href="#" onclick="setPage('7');return false;" class="prev ">R<span>‹</span></a> <a href="#" onclick="setPage('9');return false;" class="next ">R<span>›</span></a> </div> </div> </div> Can BS handle this somehow?
|
# ? Jan 8, 2016 10:01 |
|
floppo posted:Another BeautifulSoup type question:
|
# ? Jan 8, 2016 10:51 |
|
thanks - Selenium is working very well
|
# ? Jan 8, 2016 15:49 |
|
There's a general problem with python when using it for jobs with large core counts on HPC machines: the file system performance hit on importing modules is quite significant. One approach to solving this problem is here: https://github.com/rainwoodman/python-mpi-bcast There are other approaches. Has anyone else in this thread had to deal with this problem? What is your preferred (and preferably low effort) solution?
|
# ? Jan 9, 2016 18:44 |
|
Stupid question incoming: Big newbie to programming here. What's the point of dictionaries? I'm used to the more C-like struct and they seem similar but I don't get the differences. Why assign values to strings? Why not just a list of variables (man reaching for that " key is really hard okay)? From what I understand you can add and remove entries to the dictionary whenever you want but the applications for that fly completely over my head (I'm dumb sorry) I just feel like whatever I could do with that could just as easily be done without it (C has no associative arrays right?) Hopefully nobody got an aneurysm from that question.
|
# ? Jan 11, 2016 06:33 |
|
Being able to have a structure hold a runtime-variable amount of entries is really important to a lot of use cases - what if the user needs to be able to add more things to your "list of variables" on the fly, like if you have some kind of list program? What if you don't necessarily know the exact form of the struct you're going to receive because of possible errors or unforseen problems, like doing interaction over the web via JSON? Dictionaries are trivially searchable by their key, which is another neat feature that is more inconvenient without them. A feature not being in C isn't particularly telling, either; C doesn't have a lot of things built-in most modern languages have, like lists, try/catch, generic types, lambdas, first-class functions, etc. We still like to have those things.
|
# ? Jan 11, 2016 06:53 |
|
Highblood posted:Stupid question incoming: Big newbie to programming here. What's the point of dictionaries? I'm used to the more C-like struct [...] Well those two (dictionaries and C structs) are fairly different things. The members of a struct are hard coded whereas you can put what you want into a dictionary at runtime. Your question is a bit like asking what's the use of Lists, or of vector from C++; what's the point when I can just declare an array big enough to hold what I want? The answer to that is another question: what if you don't know how big you want it to be? Going back to the dictionary example, maybe you not only don't know how many items you need but also what "identity" (key) those items have. The keys will generally have a meaningful identity which you may or may not know when you write the code. Consider for example that you might have a dictionary that maps names of countries to their populations. You don't know up front what countries might be represented, and it wouldn't be practical to just list all possible countries, not least because your list might go out of date. By the way, in Python more than in many other languages, the line between a class (the nearest thing Python has to a C struct) and a dictionary is a bit blurred, but that's something that you only learn about if you get intimately familiar with the language.
|
# ? Jan 11, 2016 10:41 |
|
Dicts were the feature that convinced me to learn Python. I've told my Python class that my personal 1st rule of Python is: "when in doubt, it's probably best to use a list comprehension" and my personal 2nd rule of Python is: "when still in doubt, it's probably best to use a dict". Related question: I've told my Python class that my personal 1st rule of Python is: "when in doubt, it's probably best to use a list comprehension" and my personal 2nd rule of Python is: "when still in doubt, it's probably best to use a dict". Is that bad?
|
# ? Jan 11, 2016 14:41 |
|
Cingulate posted:Dicts were the feature that convinced me to learn Python. I've told my Python class that my personal 1st rule of Python is: "when in doubt, it's probably best to use a list comprehension" and my personal 2nd rule of Python is: "when still in doubt, it's probably best to use a dict". When do you break out the dictionary comprehensions?
|
# ? Jan 11, 2016 14:57 |
|
If you're teaching scientific Python I could see that as being counterproductive advice (specifically the list comprehension adage). I would hate to see `sums = [a + b for a, b in zip(array1, array2)] in scientific code. And yeah I was going to say the same re: dict comprehensions. Those aphorisms are vague, but then again they're just aphorisms and presumably they complement some instruction so whatever. But I would say something like "when in doubt, use core data structures (usually a list, tuple, or dictionary)" and "use comprehensions to construct lists, dicts, and sets when possible".
|
# ? Jan 11, 2016 15:05 |
|
SurgicalOntologist posted:I would hate to see `sums = [a + b for a, b in zip(array1, array2)] in scientific code. lol that's me, I write this kind of code all the time...
|
# ? Jan 11, 2016 16:00 |
|
Break the habit! sums = array1 + array2 is easier in every way.
|
# ? Jan 11, 2016 16:10 |
|
But python doesn't have that syntax natively. Python2: >>> arr1 = [5, 10, 15] >>> arr2 = [2, 4, 6] >>> print arr1 + arr2 [5, 10, 15, 2, 4, 6] Python3 is the same result. Are you sure you're not just expecting numpy or scipy or something similar to be installed by default?
|
# ? Jan 11, 2016 16:18 |
|
Right, that's what I was implying with "if you're teaching scientific Python" as well as by using "array" in the variable names, sorry for not being clear. If you're not using numpy arrays, then whatever. I was talking about using numpy/scipy for certain things but still using list comprehensions for basic operations, which is something I see quite a lot.
|
# ? Jan 11, 2016 16:22 |
|
Hammerite posted:When do you break out the dictionary comprehensions? SurgicalOntologist posted:If you're teaching scientific Python I could see that as being counterproductive advice (specifically the list comprehension adage). I would hate to see `sums = [a + b for a, b in zip(array1, array2)] in scientific code. Symbolic Butt posted:lol that's me, I write this kind of code all the time... SurgicalOntologist posted:Break the habit! sums = array1 + array2 is easier in every way. df["field_1"] = [do_something_with(x, y) for x, y in zip(iterable1, iterable2)] df["field_2"] = [x + y for x, y in zip(iterable1, iterable2)] Of course, df["field_2"] = iterable1 + iterable2 is better. But the point is, if in a non-toy case, you don't know the, or if there even is any, equivalent to iterable1 + iterable2, you can always fall back on the comp, and then later, when you make the code nice, replace your 145 comps with good code. Cingulate fucked around with this message at 16:45 on Jan 11, 2016 |
# ? Jan 11, 2016 16:39 |
|
Jewel posted:But python doesn't have that syntax natively.
|
# ? Jan 11, 2016 16:40 |
|
Jewel posted:But python doesn't have that syntax natively. those are lists, not arrays, jewel
|
# ? Jan 11, 2016 16:45 |
|
Your field2 example just hides the issue inside a function, but it's the same thing. But anyway, I don't disagree. I basically have two mindsets when coding in Python. Either I'm working with arrays, in which case I vectorize everything and include foo = np.asarray(foo) for public API functions. Or, I'm working with generators and iterables everywhere. For me I'm usually clearly in one camp or the other but if there was a more murky situation, yeah, I would stick to iterables.
|
# ? Jan 11, 2016 16:45 |
|
SurgicalOntologist posted:Your field2 example just hides the issue inside a function, but it's the same thing. This is in the realistic situation where you're primarily a scientist and don't really care about programming, and your code starts as a prototype and never really stops being a prototype.
|
# ? Jan 11, 2016 16:53 |
|
Cingulate posted:The point is, comps are fairly universal; one form across almost all usage cases. That helps if I'm teaching beginners. Unless you're not going to be using numpy at all, then whatever. Cingulate posted:This is in the realistic situation where you're primarily a scientist and don't really care about programming, and your code starts as a prototype and never really stops being a prototype.
|
# ? Jan 11, 2016 17:21 |
|
SurgicalOntologist posted:Yeah, fair enough, and I agree with this as far as what to teach first. My issue is with making "if in doubt, use list comprehensions" your #1 maxim assuming you'll later teach "but in certain situations [when you have a numpy array] you may prefer a vectorized operation." SurgicalOntologist posted:Unless you're not going to be using numpy at all, then whatever. SurgicalOntologist posted:For me this makes me more likely to code it the vectorized way first, because it's easier to code and if it's a prototype I'm not worrying about "what if they pass an iterable." This seems logical to me: with longer-term maintained code for wider use, use the more general form. When you're just whipping something together, use the specific form if it's easier to write. And if it's too slow/long, there probably is a better way.
|
# ? Jan 11, 2016 17:26 |
|
Okay, fair enough, sounds reasonable. Sorry for splitting hairs.
|
# ? Jan 11, 2016 17:28 |
|
SurgicalOntologist posted:Okay, fair enough, sounds reasonable. Sorry for splitting hairs.
|
# ? Jan 11, 2016 17:30 |
|
Suspicious Dish posted:those are lists, not arrays, jewel In the context of general Python programming, I would interpret "array" as synonymous with "list", since lists are the closest match for the notion of array in the Python standard library. In the context of using Python with Numpy, there might be an argument that this no longer holds; but as discussed, Numpy was at best implicitly introduced to the discussion context.
|
# ? Jan 11, 2016 18:05 |
|
|
# ? Jun 11, 2024 20:43 |
|
Hammerite posted:In the context of general Python programming, I would interpret "array" as synonymous with "list", since lists are the closest match for the notion of array in the Python standard library. you forgot about https://docs.python.org/2/library/array.html as you should
|
# ? Jan 11, 2016 18:41 |