Wallet posted:I have very limited programming experience generally and even less experience with Python, so I'll apologize if this is a really stupid question, but I wasn't able to find much from googling: Either approach should work just fine, but the first one is probably better. 90000 rows is really not that many.
|
|
# ? Jan 8, 2018 17:53 |
|
|
# ? Jun 5, 2024 05:31 |
|
That's a good use case for pandas. Something likePython code:
|
# ? Jan 8, 2018 18:12 |
|
SurgicalOntologist posted:That's a good use case for pandas. Something like
|
# ? Jan 8, 2018 18:17 |
Yeah, that was going to be my question. Isn't the benefit of that just that the functionality is built-in and easy to use rather than that it's efficient?
|
|
# ? Jan 8, 2018 18:18 |
|
Only way to tell is to test. In this case I would guess it's faster. read_csv will use a C implementation, so if that part is slow there should be an improvement over the builtin csv module. The indexing itself should be fairly well optimized as well, and if speed is an issue sorting the frame or the lookup keys might help. Generally the "pandas gives you a slowdown" is in comparison to numpy, not pure python.
|
# ? Jan 8, 2018 18:22 |
|
Indeed; the comparisons I've done have mostly been numpy -> pandas. IIRC loc itself can be OOM slower than numpy indexing.
|
# ? Jan 8, 2018 18:35 |
|
Yeah, the indexing speed considerations would depend on the type of keys. If they are consecutive ints this would be doable (and probably fastest) in numpy, but with arbitrary keys it's either pandas or a dict, and pandas might be faster in some cases and shouldn't be noticeable slower in others.
|
# ? Jan 8, 2018 18:58 |
|
Thanks for the responses, guys. I may have to try pandas, but I'll probably stick with the current implementation given that it's apparently reasonable.
|
# ? Jan 8, 2018 20:07 |
|
If you expect to do more data analysis with Python in the future, the recommendation is probably to just learn pandas right now.
|
# ? Jan 8, 2018 20:38 |
|
Speaking of which, the most popular Udemy course for Pandas is on sale for $11. I've just started using pandas/numpy at work for some data analysis and it was a lot to learn once you get beyond the basics. https://www.udemy.com/data-analysis-with-pandas/ I haven't actually done the course yet but can't go wrong with the one everyone else is flocking to I guess. Also, regarding pandas slowdowns, using library that saves minutes or hours of dev time always wins for me over a faster implementation. Except, of course, if it's going into production and actually can't keep up with the workload. Then you optimize.
|
# ? Jan 9, 2018 05:27 |
|
Wallet posted:I have very limited programming experience generally and even less experience with Python, so I'll apologize if this is a really stupid question, but I wasn't able to find much from googling: The best approach may depend on what you want to do with the keys and values. Do you want to iterate over every key/value pair? code:
code:
code:
QuarkJets fucked around with this message at 06:10 on Jan 9, 2018 |
# ? Jan 9, 2018 06:08 |
|
QuarkJets posted:The best approach may depend on what you want to do with the keys and values. Do you want to iterate over every key/value pair? This is a good question, as I realized that I was only really considering the function that was finding the values for a given list of keys while the reason was probably important. It turns out that the only reason the function was ever being called was to figure out which key in a list of keys had the highest numerical value in a given column in the file, so now I'm just creating a list of keys sorted by the relevant value and then iterating through that list until any of the desired keys is found, at which point the rest don't matter. No one will be surprised to learn that this is faster.
|
# ? Jan 9, 2018 17:26 |
|
A big thank you to the goons who had map suggestions. I've been trying out folium, but I'm trying to find a way to get all the values into one map. From what I can glean from the documentation, folium uses a format like this:code:
1) Generate a sufficient amount of lines with the content: folium.Marker([x, y]).add_to(map_1) 2) Fill in x and y with the lat/long values from the spreadsheet I'm not sure how to do this. I've been able to read the data from the spreadsheet: code:
|
# ? Jan 10, 2018 03:13 |
|
Should I look into selenium if I want to scrape data off a web page that's intentionally obfuscated ?
|
# ? Jan 10, 2018 12:36 |
|
Seventh Arrow posted:each line needs to take the value from the next row in the spreadsheet.
|
# ? Jan 10, 2018 13:23 |
|
unpacked robinhood posted:Should I look into selenium if I want to scrape data off a web page that's intentionally obfuscated ? Obfuscated how? The only thing Selenium will give you over, say, Scrapy is JS execution. That could be a big deal depending on what any given page is doing, but if it's just intentionally confusing, difficult to interpret or badly laid out in a rendered document, you won't see any benefit.
|
# ? Jan 10, 2018 14:33 |
|
Cingulate posted:I don't understand what this means. What I'm trying to say is that each folium line can't keep reading the first row over and over again. The first folium line needs to use row 1, the second one needs to use row 2, and so on.
|
# ? Jan 10, 2018 15:17 |
|
Seventh Arrow posted:What I'm trying to say is that each folium line can't keep reading the first row over and over again. The first folium line needs to use row 1, the second one needs to use row 2, and so on. code:
Sorry if I'm totally missing your point.
|
# ? Jan 10, 2018 15:50 |
|
If you need to pull the value of the next row into the current row, then create a new column with shift(-1)?
|
# ? Jan 10, 2018 15:55 |
|
Seventh Arrow posted:What I'm trying to say is that each folium line can't keep reading the first row over and over again. The first folium line needs to use row 1, the second one needs to use row 2, and so on. If your dataframe has 'Lat', 'Long' and 'Description' columns, then I think this is what you might be looking for: Python code:
|
# ? Jan 10, 2018 16:01 |
|
vikingstrike posted:If you need to pull the value of the next row into the current row, then create a new column with shift(-1)? code:
Jose Cuervo posted:If your dataframe has 'Lat', 'Long' and 'Description' columns, then I think this is what you might be looking for: Python code:
|
# ? Jan 10, 2018 16:01 |
|
Munkeymon posted:Obfuscated how? The only thing Selenium will give you over, say, Scrapy is JS execution. That could be a big deal depending on what any given page is doing, but if it's just intentionally confusing, difficult to interpret or badly laid out in a rendered document, you won't see any benefit. Obfuscated is innacurate. I'd say it's not meant to be parsable at least. I'd like to read toll prices from an online calculator. The values I'm interested are displayed but don't appear in the source. Using the Inspector I've found them in a json that's part of an answer to a request with a bunch of authkey and other values in parameter. Copy pasting the request url around to a "fresh" browser (no cookies etc) seems to be enough to get an answer, however removing the authkey value gives an "Access denied" message, as well as trying to get the file in curl for example. I haven't tried doing a request with a made up User agent, but I'm concerned the authkey value may expire. If selenium could load the page, fill the form, catch the url for the request I'm interested in and load the json somewhere clean it would be nice. unpacked robinhood fucked around with this message at 16:17 on Jan 10, 2018 |
# ? Jan 10, 2018 16:13 |
Sounds like it's just an SPA where the data you want is filled in via ajax calls. So your test framework needs to run JS with full browser-like capabilities. (Or access the API directly, but...)
|
|
# ? Jan 10, 2018 16:16 |
|
Cingulate posted:Although ideally, you'd vectorise that. Sorry for the confusion. Maybe I can make it clearer: I need these lines "folium.Marker([x, y]..." populating the python script so they can put markers on the folium map. Except there's thousands of rows in the latitude/longitude csv, so I'm not going to write each folium line by hand. So instead I need a way to get python to generate a bunch of "folium.Marker([x, y]..." lines, but also fill in the latitude/longitude information. Is that a bit better? In the meantime, I'll take a look at your and Jose Cuervo's suggestions - thanks! edit: of course, loading that much data into folium at once is another issue, but one thing at a time... Seventh Arrow fucked around with this message at 16:50 on Jan 10, 2018 |
# ? Jan 10, 2018 16:47 |
|
Jose Cuervo posted:If your dataframe has 'Lat', 'Long' and 'Description' columns, then I think this is what you might be looking for:
|
# ? Jan 10, 2018 16:50 |
|
Fair enough - but how would you vectorize what Seventh Arrow wants to do? Edit: Or are you saying use something like: Python code:
Jose Cuervo fucked around with this message at 17:45 on Jan 10, 2018 |
# ? Jan 10, 2018 17:41 |
|
unpacked robinhood posted:Obfuscated is innacurate. I'd say it's not meant to be parsable at least. In these scenarios I often find it's not that hard to mimic the underlying API calls rather than use Selenium. Often you can find additional internal data as well. You seem to have started in that direction, you just need to figure out how to get the authkey. To do that just figure out what API calls are made when you log in. Then use a requests.Session to maintain your cookie/headers. Likely you don't even have to find the authkey manually, but it will be automatically stored in the headers. For example: Python code:
SurgicalOntologist fucked around with this message at 17:46 on Jan 10, 2018 |
# ? Jan 10, 2018 17:44 |
|
Seventh Arrow posted:Sorry for the confusion. Maybe I can make it clearer: I need these lines "folium.Marker([x, y]..." populating the python script so they can put markers on the folium map. Except there's thousands of rows in the latitude/longitude csv, so I'm not going to write each folium line by hand. Python code:
(Can't vectorise if folium.Marker doesn't take array input.) Seventh Arrow, what's throwing me off is you keep writing you want to "generate lines". But what you do want is to have Python go through the data and use the values, not literally create these lines of code, right?
|
# ? Jan 10, 2018 18:13 |
|
Jose Cuervo posted:Fair enough - but how would you vectorize what Seventh Arrow wants to do? He could keep the data in DF form for most uses, and use the array for iterating and/or mass-indexing. Python code:
Dominoes fucked around with this message at 18:19 on Jan 10, 2018 |
# ? Jan 10, 2018 18:13 |
|
If you're looping and plotting I don't think it's going to matter that much how you loop. The plotting is likely many orders of magnitude slower than the looping. In any case it sounds like the issue isn't speed but just understanding of looping. SurgicalOntologist fucked around with this message at 18:25 on Jan 10, 2018 |
# ? Jan 10, 2018 18:23 |
|
Cingulate posted:Seventh Arrow, what's throwing me off is you keep writing you want to "generate lines". But what you do want is to have Python go through the data and use the values, not literally create these lines of code, right? Yes, I think so. Maybe a better way to put it is that I want folium to put a marker on the map for every lat/long coordinate in the csv. Whatever python voodoo it takes to do that is irrelevant to me (unless it actually involves sacrificing chickens on an altar).
|
# ? Jan 10, 2018 18:30 |
|
itertuples() is much quicker than iterrows(), and might be a nice middle ground.
|
# ? Jan 10, 2018 18:30 |
|
Seventh Arrow posted:Yes, I think so. Maybe a better way to put it is that I want folium to put a marker on the map for every lat/long coordinate in the csv. Whatever python voodoo it takes to do that is irrelevant to me (unless it actually involves sacrificing chickens on an altar). I hope it's halfway intuitive what's going on - the code:
code:
|
# ? Jan 10, 2018 18:38 |
|
Ok, thank you. What happened to the "shift(-1)"? Is that no longer necessary?
|
# ? Jan 10, 2018 18:43 |
|
Cingulate posted:If the thing you need to do is indeed to go through the data line by line, and for each line, run the Marker thing on that line's values, then you could indeed do what I'm suggesting here: I don't know that the code you have works. Running the following (I think equivalent) code results in 'ValueError: need more than 1 value to unpack' Python code:
|
# ? Jan 10, 2018 19:16 |
|
Seventh Arrow posted:Ok, thank you. What happened to the "shift(-1)"? Is that no longer necessary?
|
# ? Jan 10, 2018 19:16 |
|
Jose Cuervo posted:I don't know that the code you have works. Running the following (I think equivalent) code results in 'ValueError: need more than 1 value to unpack' Python code:
code:
|
# ? Jan 10, 2018 19:18 |
|
SurgicalOntologist posted:In these scenarios I often find it's not that hard to mimic the underlying API calls rather than use Selenium. Often you can find additional internal data as well. Thanks a lot. I'll probably need this later. So far my thing manages to get the data fine but there's no way the authkey doesn't expire after a while. It seems to be the only requirement to get a valid answer. I've only added a random user agent but I'm not even sure it's necessary.
|
# ? Jan 10, 2018 22:23 |
|
unpacked robinhood posted:Obfuscated is innacurate. I'd say it's not meant to be parsable at least. There are roughly three possibilities here: They might just have that one hard-coded auth key, in which case you just keep using it and it's fine until they change something that'd probably break your scraper anyway. The key might be loaded into some DOM node and read by a script, so you can probably do that same given that static page and a little legwork. Open the Chrome dev tools and search all with Ctrl[Option] + Shift[Command] + F and search for the key to see if it's just getting squirted out into a script element or something (this is likely). It's derived from something the server sends. You're probably better off using Selenium in this case because they're being clever and paranoid.
|
# ? Jan 11, 2018 15:24 |
|
|
# ? Jun 5, 2024 05:31 |
|
Munkeymon posted:There are roughly three possibilities here: Thanks. So far they haven't changed the key since yesterday but I'm still saving this for later when they eventually do. Another question. I have a list of Step objects, and another class that goes online to retrieve information relative to each object (as in, a new request is done for each of them) I'd like to make it a thread so I don't have to wait on the request do to operations that don't require it. Should I pass a setter method from my Step object as a callback to the object doing the request, and have the values added as they come ? I have something like this in mind: Python code:
e: just to be clear the Request class is a made up name, not me using an existing module. unpacked robinhood fucked around with this message at 20:39 on Jan 11, 2018 |
# ? Jan 11, 2018 15:41 |