Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Eela6
May 25, 2007
Shredded Hen

Wallet posted:

I have very limited programming experience generally and even less experience with Python, so I'll apologize if this is a really stupid question, but I wasn't able to find much from googling:

I've got a csv file with a little over 90,000 rows that each have a key in the first column and a value in the second. I also have a list of keys that I want to retrieve the values for.

Currently, I'm using csv.reader to read the file into a dictionary and then looping through my list of keys to retrieve the value for each from the dictionary. This works, but I have a feeling that this is a really stupid/inefficient way of going about things.

The other approach that comes to mind is creating a duplicate of the list of keys that I want to retrieve values for, iterating through the rows of the file checking if that row matches any of the keys I'm after, storing the value and removing the key from my duplicate list if it does match, and continuing on until the duplicate list is empty.

Am I an idiot? Is either of these approaches appropriate? Is there a better solution?

Either approach should work just fine, but the first one is probably better. 90000 rows is really not that many.

Adbot
ADBOT LOVES YOU

SurgicalOntologist
Jun 17, 2004

That's a good use case for pandas. Something like

Python code:
pd.read_csv(filename, index_col=0).loc[keys_to_retrieve]
Edit: to elaborate, in pandas a Series/DataFrame is basically a dictionary that allows vectorized lookups.

Dominoes
Sep 20, 2007

SurgicalOntologist posted:

That's a good use case for pandas. Something like

Python code:
pd.read_csv(filename, index_col=0).loc[keys_to_retrieve]
Edit: to elaborate, in pandas a Series/DataFrame is basically a dictionary that allows vectorized lookups.
How does this compare to builtins speed-wise? PD seems to be a minefield of slowdowns if used improperly.

Data Graham
Dec 28, 2009

📈📊🍪😋



Yeah, that was going to be my question. Isn't the benefit of that just that the functionality is built-in and easy to use rather than that it's efficient?

SurgicalOntologist
Jun 17, 2004

Only way to tell is to test. In this case I would guess it's faster. read_csv will use a C implementation, so if that part is slow there should be an improvement over the builtin csv module. The indexing itself should be fairly well optimized as well, and if speed is an issue sorting the frame or the lookup keys might help.

Generally the "pandas gives you a slowdown" is in comparison to numpy, not pure python.

Dominoes
Sep 20, 2007

Indeed; the comparisons I've done have mostly been numpy -> pandas. IIRC loc itself can be OOM slower than numpy indexing.

SurgicalOntologist
Jun 17, 2004

Yeah, the indexing speed considerations would depend on the type of keys. If they are consecutive ints this would be doable (and probably fastest) in numpy, but with arbitrary keys it's either pandas or a dict, and pandas might be faster in some cases and shouldn't be noticeable slower in others.

Wallet
Jun 19, 2006

Thanks for the responses, guys. I may have to try pandas, but I'll probably stick with the current implementation given that it's apparently reasonable.

Cingulate
Oct 23, 2012

by Fluffdaddy
If you expect to do more data analysis with Python in the future, the recommendation is probably to just learn pandas right now.

porksmash
Sep 30, 2008
Speaking of which, the most popular Udemy course for Pandas is on sale for $11. I've just started using pandas/numpy at work for some data analysis and it was a lot to learn once you get beyond the basics. https://www.udemy.com/data-analysis-with-pandas/ I haven't actually done the course yet but can't go wrong with the one everyone else is flocking to I guess.

Also, regarding pandas slowdowns, using library that saves minutes or hours of dev time always wins for me over a faster implementation. Except, of course, if it's going into production and actually can't keep up with the workload. Then you optimize.

QuarkJets
Sep 8, 2008

Wallet posted:

I have very limited programming experience generally and even less experience with Python, so I'll apologize if this is a really stupid question, but I wasn't able to find much from googling:

I've got a csv file with a little over 90,000 rows that each have a key in the first column and a value in the second. I also have a list of keys that I want to retrieve the values for.

Currently, I'm using csv.reader to read the file into a dictionary and then looping through my list of keys to retrieve the value for each from the dictionary. This works, but I have a feeling that this is a really stupid/inefficient way of going about things.

The other approach that comes to mind is creating a duplicate of the list of keys that I want to retrieve values for, iterating through the rows of the file checking if that row matches any of the keys I'm after, storing the value and removing the key from my duplicate list if it does match, and continuing on until the duplicate list is empty.

Am I an idiot? Is either of these approaches appropriate? Is there a better solution?

The best approach may depend on what you want to do with the keys and values. Do you want to iterate over every key/value pair?

code:
for key, value in csv_dict.items()"
    do_something(key, value)
Do you only want to access specific keys?

code:
for key in interesting_key_set:
    if key in csv_dict:
        do_something(key, csv_dict[key])
Are you only interested in using the dictionary to filter out duplicate keys, and then you're only really interested in doing something with all of the values?

code:
for value in csv_dict.values():
  do_something(value)

QuarkJets fucked around with this message at 06:10 on Jan 9, 2018

Wallet
Jun 19, 2006

QuarkJets posted:

The best approach may depend on what you want to do with the keys and values. Do you want to iterate over every key/value pair?

This is a good question, as I realized that I was only really considering the function that was finding the values for a given list of keys while the reason was probably important. It turns out that the only reason the function was ever being called was to figure out which key in a list of keys had the highest numerical value in a given column in the file, so now I'm just creating a list of keys sorted by the relevant value and then iterating through that list until any of the desired keys is found, at which point the rest don't matter.

No one will be surprised to learn that this is faster.

Seventh Arrow
Jan 26, 2005

A big thank you to the goons who had map suggestions. I've been trying out folium, but I'm trying to find a way to get all the values into one map. From what I can glean from the documentation, folium uses a format like this:

code:
map_1 = folium.Map(location=[45.372, -121.6972],
                  zoom_start=12,
                  tiles='Stamen Terrain')
folium.Marker([45.3288, -121.6625], popup='Mt. Hood Meadows').add_to(map_1)
folium.Marker([45.3311, -121.7113], popup='Timberline Lodge').add_to(map_1)
map_1
As mentioned, I have a csv file with the latitude and longitude and it actually even has a field with both values in the one cell. So as far as I can tell I need to do two things:

1) Generate a sufficient amount of lines with the content: folium.Marker([x, y]).add_to(map_1)

2) Fill in x and y with the lat/long values from the spreadsheet

I'm not sure how to do this. I've been able to read the data from the spreadsheet:

code:
import pandas as pd
import folium

df_raw = pd.read_excel('df_condo_v9_t1.xlsx', sheetname=0, header=0)

df_raw.shape

df_raw.dtypes

df_lat = df_raw['Latlng']

df_lat.head()
But I'm not really sure what to do next. I think that the folium lines can be formatted "folium.Marker([df_lat]).add_to(map_1)" but even that's not so straightforward because each line needs to take the value from the next row in the spreadsheet. Any suggestions would be appreciated.

unpacked robinhood
Feb 18, 2013

by Fluffdaddy
Should I look into selenium if I want to scrape data off a web page that's intentionally obfuscated ?

Cingulate
Oct 23, 2012

by Fluffdaddy

Seventh Arrow posted:

each line needs to take the value from the next row in the spreadsheet.
I don't understand what this means.

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



unpacked robinhood posted:

Should I look into selenium if I want to scrape data off a web page that's intentionally obfuscated ?

Obfuscated how? The only thing Selenium will give you over, say, Scrapy is JS execution. That could be a big deal depending on what any given page is doing, but if it's just intentionally confusing, difficult to interpret or badly laid out in a rendered document, you won't see any benefit.

Seventh Arrow
Jan 26, 2005

Cingulate posted:

I don't understand what this means.

What I'm trying to say is that each folium line can't keep reading the first row over and over again. The first folium line needs to use row 1, the second one needs to use row 2, and so on.

Cingulate
Oct 23, 2012

by Fluffdaddy

Seventh Arrow posted:

What I'm trying to say is that each folium line can't keep reading the first row over and over again. The first folium line needs to use row 1, the second one needs to use row 2, and so on.
Maybe you can describe what you need to happen, conceptually, but you can loop over the data frame. If you loop over a column (-> a pandas Series), it'll usually be equivalent to just looping over the content. E.g.,

code:
for value in df_raw['Latlng']:
    folium.do_something(value)
Although ideally, you'd vectorise that.

Sorry if I'm totally missing your point.

vikingstrike
Sep 23, 2007

whats happening, captain
If you need to pull the value of the next row into the current row, then create a new column with shift(-1)?

Jose Cuervo
Aug 25, 2004

Seventh Arrow posted:

What I'm trying to say is that each folium line can't keep reading the first row over and over again. The first folium line needs to use row 1, the second one needs to use row 2, and so on.

If your dataframe has 'Lat', 'Long' and 'Description' columns, then I think this is what you might be looking for:

Python code:
for idx, row in df_raw.iterrows():
    folium.Marker([row['Lat'], row['Long']], popup=row['Description']).add_to(map_1)

Cingulate
Oct 23, 2012

by Fluffdaddy

vikingstrike posted:

If you need to pull the value of the next row into the current row, then create a new column with shift(-1)?
Yeah maybe what you need is something like

code:
for first_val, second_val in zip(df_raw['Latlng'], df_raw['Latlng'].shift(-1)):
    folium.do_something(first_val, second_val)

Jose Cuervo posted:

If your dataframe has 'Lat', 'Long' and 'Description' columns, then I think this is what you might be looking for:

Python code:
for idx, row in df_raw.iterrows():
    folium.Marker([row['Lat'], row['Long']], popup=row['Description']).add_to(map_1)
or
Python code:
for lat, long, description in df[["Lat", "Long", "Description"]]:
    folium.Marker([lat, long], popup=description).add_to(map_1)

unpacked robinhood
Feb 18, 2013

by Fluffdaddy

Munkeymon posted:

Obfuscated how? The only thing Selenium will give you over, say, Scrapy is JS execution. That could be a big deal depending on what any given page is doing, but if it's just intentionally confusing, difficult to interpret or badly laid out in a rendered document, you won't see any benefit.

Obfuscated is innacurate. I'd say it's not meant to be parsable at least.

I'd like to read toll prices from an online calculator. The values I'm interested are displayed but don't appear in the source.
Using the Inspector I've found them in a json that's part of an answer to a request with a bunch of authkey and other values in parameter.

Copy pasting the request url around to a "fresh" browser (no cookies etc) seems to be enough to get an answer, however removing the authkey value gives an "Access denied" message, as well as trying to get the file in curl for example.

I haven't tried doing a request with a made up User agent, but I'm concerned the authkey value may expire.

If selenium could load the page, fill the form, catch the url for the request I'm interested in and load the json somewhere clean it would be nice.

unpacked robinhood fucked around with this message at 16:17 on Jan 10, 2018

Data Graham
Dec 28, 2009

📈📊🍪😋



Sounds like it's just an SPA where the data you want is filled in via ajax calls. So your test framework needs to run JS with full browser-like capabilities.

(Or access the API directly, but...)

Seventh Arrow
Jan 26, 2005

Cingulate posted:

Although ideally, you'd vectorise that.

Sorry if I'm totally missing your point.

Sorry for the confusion. Maybe I can make it clearer: I need these lines "folium.Marker([x, y]..." populating the python script so they can put markers on the folium map. Except there's thousands of rows in the latitude/longitude csv, so I'm not going to write each folium line by hand.

So instead I need a way to get python to generate a bunch of "folium.Marker([x, y]..." lines, but also fill in the latitude/longitude information. Is that a bit better?

In the meantime, I'll take a look at your and Jose Cuervo's suggestions - thanks!


edit: of course, loading that much data into folium at once is another issue, but one thing at a time...

Seventh Arrow fucked around with this message at 16:50 on Jan 10, 2018

Dominoes
Sep 20, 2007

Jose Cuervo posted:

If your dataframe has 'Lat', 'Long' and 'Description' columns, then I think this is what you might be looking for:

Python code:
for idx, row in df_raw.iterrows():
    folium.Marker([row['Lat'], row['Long']], popup=row['Description']).add_to(map_1)
iterrows is very slow.

Jose Cuervo
Aug 25, 2004

Fair enough - but how would you vectorize what Seventh Arrow wants to do?

Edit: Or are you saying use something like:
Python code:
for lat, long, description in zip(df['Lat'], df['Long'], df['Description']):
    folium.Marker([lat, long], popup=description).add_to(map_1)

Jose Cuervo fucked around with this message at 17:45 on Jan 10, 2018

SurgicalOntologist
Jun 17, 2004

unpacked robinhood posted:

Obfuscated is innacurate. I'd say it's not meant to be parsable at least.

I'd like to read toll prices from an online calculator. The values I'm interested are displayed but don't appear in the source.
Using the Inspector I've found them in a json that's part of an answer to a request with a bunch of authkey and other values in parameter.

Copy pasting the request url around to a "fresh" browser (no cookies etc) seems to be enough to get an answer, however removing the authkey value gives an "Access denied" message, as well as trying to get the file in curl for example.

I haven't tried doing a request with a made up User agent, but I'm concerned the authkey value may expire.

If selenium could load the page, fill the form, catch the url for the request I'm interested in and load the json somewhere clean it would be nice.

In these scenarios I often find it's not that hard to mimic the underlying API calls rather than use Selenium. Often you can find additional internal data as well.

You seem to have started in that direction, you just need to figure out how to get the authkey. To do that just figure out what API calls are made when you log in. Then use a requests.Session to maintain your cookie/headers. Likely you don't even have to find the authkey manually, but it will be automatically stored in the headers. For example:
Python code:
with requests.Session() as session:
    session.headers.update({'User-Agent': '...'})  # may or may not be necessary

    response = session.post(login_url, data=dict(username='...', password='...'))  # use dev tools to find out what you need to post
    response.raise_for_status()

    response = session.get(data_url)
    response.raise_for_status()
    data = response.json()
If that doesn't work, the next thing to try is to GET the login page before POSTing to the login API. That's sometimes necessary in case the website expects your session to begin that way. In any case you usually don't need much more than the above lines and it is usually much more reliable and faster than selenium. I maintain a lot of scrapers and I've switched them all over to this method.

SurgicalOntologist fucked around with this message at 17:46 on Jan 10, 2018

Cingulate
Oct 23, 2012

by Fluffdaddy

Seventh Arrow posted:

Sorry for the confusion. Maybe I can make it clearer: I need these lines "folium.Marker([x, y]..." populating the python script so they can put markers on the folium map. Except there's thousands of rows in the latitude/longitude csv, so I'm not going to write each folium line by hand.

So instead I need a way to get python to generate a bunch of "folium.Marker([x, y]..." lines, but also fill in the latitude/longitude information. Is that a bit better?

In the meantime, I'll take a look at your and Jose Cuervo's suggestions - thanks!


edit: of course, loading that much data into folium at once is another issue, but one thing at a time...
If the thing you need to do is indeed to go through the data line by line, and for each line, run the Marker thing on that line's values, then you could indeed do what I'm suggesting here:

Python code:
for lat, long, description in df[["Lat", "Long", "Description"]]:
    folium.Marker([lat, long], popup=description).add_to(map_1)
Also goes to Jose Cuervo.
(Can't vectorise if folium.Marker doesn't take array input.)

Seventh Arrow, what's throwing me off is you keep writing you want to "generate lines". But what you do want is to have Python go through the data and use the values, not literally create these lines of code, right?

Dominoes
Sep 20, 2007

Jose Cuervo posted:

Fair enough - but how would you vectorize what Seventh Arrow wants to do?
I'm not sure - without an answer to this question, I'd convert to an array, then perform equivalent operations on it. Something like this should be several thousand times faster, depending on the data etc.

He could keep the data in DF form for most uses, and use the array for iterating and/or mass-indexing.

Python code:
data = df_raw.values

lat_index = 1
lon_index = 2
description_index = 3

for row in data:
    folium.Marker([row[lat_index], row[lon_index], popup=row[description_index]).add_to(map_1)

Dominoes fucked around with this message at 18:19 on Jan 10, 2018

SurgicalOntologist
Jun 17, 2004

If you're looping and plotting I don't think it's going to matter that much how you loop. The plotting is likely many orders of magnitude slower than the looping.

In any case it sounds like the issue isn't speed but just understanding of looping.

SurgicalOntologist fucked around with this message at 18:25 on Jan 10, 2018

Seventh Arrow
Jan 26, 2005

Cingulate posted:

Seventh Arrow, what's throwing me off is you keep writing you want to "generate lines". But what you do want is to have Python go through the data and use the values, not literally create these lines of code, right?

Yes, I think so. Maybe a better way to put it is that I want folium to put a marker on the map for every lat/long coordinate in the csv. Whatever python voodoo it takes to do that is irrelevant to me (unless it actually involves sacrificing chickens on an altar).

vikingstrike
Sep 23, 2007

whats happening, captain
itertuples() is much quicker than iterrows(), and might be a nice middle ground.

Cingulate
Oct 23, 2012

by Fluffdaddy

Seventh Arrow posted:

Yes, I think so. Maybe a better way to put it is that I want folium to put a marker on the map for every lat/long coordinate in the csv. Whatever python voodoo it takes to do that is irrelevant to me (unless it actually involves sacrificing chickens on an altar).
Yeah then out of the solutions suggested so far, I think mine is the best.

I hope it's halfway intuitive what's going on - the
code:
df[["Lat", "Long", "Description"]]
part is extracting just these 3 rows of the data frame in just this order (if there are no other rows, it's pointless, then you just need to check the order), and the
code:
for lat, long, description in ...
part goes through the rows line by line and calls the first row's value lat, the second row's long, etc, and passes them onto the bod of the loop, where you can run your Folium function.

Seventh Arrow
Jan 26, 2005

Ok, thank you. What happened to the "shift(-1)"? Is that no longer necessary?

Jose Cuervo
Aug 25, 2004

Cingulate posted:

If the thing you need to do is indeed to go through the data line by line, and for each line, run the Marker thing on that line's values, then you could indeed do what I'm suggesting here:

Python code:
for lat, long, description in df[["Lat", "Long", "Description"]]:
    folium.Marker([lat, long], popup=description).add_to(map_1)
Also goes to Jose Cuervo.
(Can't vectorise if folium.Marker doesn't take array input.)

Seventh Arrow, what's throwing me off is you keep writing you want to "generate lines". But what you do want is to have Python go through the data and use the values, not literally create these lines of code, right?

I don't know that the code you have works. Running the following (I think equivalent) code results in 'ValueError: need more than 1 value to unpack'
Python code:
import pandas as pd

df = pd.DataFrame({'a': [1,2,3,4,5], 'b': [2,3,4,5,6], 'c': [3,4,5,6,7]})

for i, j, k in df[['a', 'b', 'c']]:
	print i, j, k

Cingulate
Oct 23, 2012

by Fluffdaddy

Seventh Arrow posted:

Ok, thank you. What happened to the "shift(-1)"? Is that no longer necessary?
That came out of a misunderstanding of what you wanted. It assumed that you wanted, in each iteration of the loop, the nth and the n+1th item.

Cingulate
Oct 23, 2012

by Fluffdaddy

Jose Cuervo posted:

I don't know that the code you have works. Running the following (I think equivalent) code results in 'ValueError: need more than 1 value to unpack'
Python code:
import pandas as pd

df = pd.DataFrame({'a': [1,2,3,4,5], 'b': [2,3,4,5,6], 'c': [3,4,5,6,7]})

for i, j, k in df[['a', 'b', 'c']]:
	print i, j, k
Ah yes, sorry. Make it

Python code:
for lat, long, description in df[["Lat", "Long", "Description"]].values:
    folium.Marker([lat, long], popup=description).add_to(map_1)
Without the
code:
values
, you're just iterating over the column labels.

unpacked robinhood
Feb 18, 2013

by Fluffdaddy

SurgicalOntologist posted:

In these scenarios I often find it's not that hard to mimic the underlying API calls rather than use Selenium. Often you can find additional internal data as well.

You seem to have started in that direction, you just need to figure out how to get the authkey. To do that just figure out what API calls are made when you log in. Then use a requests.Session to maintain your cookie/headers. Likely you don't even have to find the authkey manually, but it will be automatically stored in the headers. For example:
Python code:
with requests.Session() as session:
    session.headers.update({'User-Agent': '...'})  # may or may not be necessary

    response = session.post(login_url, data=dict(username='...', password='...'))  # use dev tools to find out what you need to post
    response.raise_for_status()

    response = session.get(data_url)
    response.raise_for_status()
    data = response.json()
If that doesn't work, the next thing to try is to GET the login page before POSTing to the login API. That's sometimes necessary in case the website expects your session to begin that way. In any case you usually don't need much more than the above lines and it is usually much more reliable and faster than selenium. I maintain a lot of scrapers and I've switched them all over to this method.

Thanks a lot. I'll probably need this later.
So far my thing manages to get the data fine but there's no way the authkey doesn't expire after a while. It seems to be the only requirement to get a valid answer. I've only added a random user agent but I'm not even sure it's necessary.

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



unpacked robinhood posted:

Obfuscated is innacurate. I'd say it's not meant to be parsable at least.

I'd like to read toll prices from an online calculator. The values I'm interested are displayed but don't appear in the source.
Using the Inspector I've found them in a json that's part of an answer to a request with a bunch of authkey and other values in parameter.

Copy pasting the request url around to a "fresh" browser (no cookies etc) seems to be enough to get an answer, however removing the authkey value gives an "Access denied" message, as well as trying to get the file in curl for example.

I haven't tried doing a request with a made up User agent, but I'm concerned the authkey value may expire.

If selenium could load the page, fill the form, catch the url for the request I'm interested in and load the json somewhere clean it would be nice.

There are roughly three possibilities here:

They might just have that one hard-coded auth key, in which case you just keep using it and it's fine until they change something that'd probably break your scraper anyway.

The key might be loaded into some DOM node and read by a script, so you can probably do that same given that static page and a little legwork. Open the Chrome dev tools and search all with Ctrl[Option] + Shift[Command] + F and search for the key to see if it's just getting squirted out into a script element or something (this is likely).

It's derived from something the server sends. You're probably better off using Selenium in this case because they're being clever and paranoid.

Adbot
ADBOT LOVES YOU

unpacked robinhood
Feb 18, 2013

by Fluffdaddy

Munkeymon posted:

There are roughly three possibilities here:

They might just have that one hard-coded auth key, in which case you just keep using it and it's fine until they change something that'd probably break your scraper anyway.

The key might be loaded into some DOM node and read by a script, so you can probably do that same given that static page and a little legwork. Open the Chrome dev tools and search all with Ctrl[Option] + Shift[Command] + F and search for the key to see if it's just getting squirted out into a script element or something (this is likely).

It's derived from something the server sends. You're probably better off using Selenium in this case because they're being clever and paranoid.

Thanks. So far they haven't changed the key since yesterday but I'm still saving this for later when they eventually do.

Another question. I have a list of Step objects, and another class that goes online to retrieve information relative to each object (as in, a new request is done for each of them)
I'd like to make it a thread so I don't have to wait on the request do to operations that don't require it.
Should I pass a setter method from my Step object as a callback to the object doing the request, and have the values added as they come ?
I have something like this in mind:

Python code:

class Step:
    self._some_value = 0

    @some_value.setter
    def some_value(self, value):
        self._some_value = value

class Request(Thread):
	def request(self, callback=None):
		#doing things
		callback('some_message')

for s in step_list:
	Request.request(s.some_value)		
As long as the setter method is only called by the Request class I should be ok ?
e: just to be clear the Request class is a made up name, not me using an existing module.

unpacked robinhood fucked around with this message at 20:39 on Jan 11, 2018

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply