Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Cingulate
Oct 23, 2012

by Fluffdaddy

CarForumPoster posted:

I just started yesterday so right now I nab everything and dump it into a CSV.

Also, is there a machine learning thread? Im wondering if I could actually skip this step all together.

I have the problem of similar data like names, dates, events are stored in tables on lots of webpages in semi-different formats on each website. I want to scrape that data and have ~machine learning~(or whatever) sort it out for me such that it is stored in a way I can add to a database. (CSV or something like that) I have ~50 sample webpages I could format how I'd like (desired output) and use to train a model with and could easily get a few hundred if it was worth while.

I know I'd need a way to vectorize either the table info or the html all together...havent figured that out yet.

Machine learning thread?
You can try the data science thread, or the stats thread.

Though I'm not sure what you actually want. (Supervised) machine learning relates standardised data of one form (predictors) to standardised data of another form (outcomes). It seems to me you're describing a part of the data wrangling process still - although that too might belong into the data science thread.

Adbot
ADBOT LOVES YOU

Cingulate
Oct 23, 2012

by Fluffdaddy
Danyull I'm not perfectly sure I get what you want, but

code:
for idx, pic in enumerate(pix):
    if not idx % rule.len:
        chunks.append(Chunk())
    chunks[-1].ch.append(pic)
Something like this?

Cingulate
Oct 23, 2012

by Fluffdaddy

Danyull posted:

You're right, I should have used the length there instead of one, however the infinite loop problem still existed. I ended up fixing it actually, but I still don't know how. For some reason when it got to the point of creating a new chunk, the new chunk.ch would already be filled with the elements of pix. Appending the elements of pix onto the end of chunk.ch was for some reason also appending the elements onto pix again, causing it to double in size each time. The issue went away after adding in "chunk.ch = []" between "chunk = Chunk()" and the appending loop.
So
code:
for idx, pic in enumerate(pix):
    if not idx % rule.len:
        chunks.append(Chunk())
        chunks[-1].ch = list()
    chunks[-1].ch.append(pic)
?

Cingulate
Oct 23, 2012

by Fluffdaddy
"while" solutions usually feel rather un-pythonic to me. I think if you want to iterate over an iterable, you should iterate over the iterable!

Counting indices (or however that thing is called) are alien and un-pthonic.

Cingulate
Oct 23, 2012

by Fluffdaddy
I found this very interesting: https://medium.com/dunder-data/python-for-data-analysis-a-critical-line-by-line-review-5d5678a4c203
A brutal review of Wes McKinney's book on pandas.

Cingulate
Oct 23, 2012

by Fluffdaddy

Dominoes posted:

Is this summary of Pandas accurate? It's how I look at it/assess when to use it:

Wrapper for 2-D and 1-D arrays that includes labels, non-numerical indexing, different syntax, and many R-style statistical methods and filters. Orders-of-magnitude slower than the wrapped array, but if a problem, can convert to an array, perform bottlenecked-calcs, then convert back to a DF/Series.
"The Python default for dealing with tabular data."

Or, for people who know R: "Dataframes for Python."

Cingulate
Oct 23, 2012

by Fluffdaddy
So in light of

Cingulate posted:

I found this very interesting: https://medium.com/dunder-data/python-for-data-analysis-a-critical-line-by-line-review-5d5678a4c203
A brutal review of Wes McKinney's book on pandas.
What's an actually good intro to data analysis with Python?

Cingulate
Oct 23, 2012

by Fluffdaddy

QuarkJets posted:

If you have actual C-style functions dealing with regular old primitive types (float, int, etc) you can just compile a shared library and load it with ctypes.

I would not recommend using Cython... for anything, really. There are better alternatives no matter what you want to do
What's bad about Cython? Don't a bunch of scientific packages use it a lot?

Cingulate
Oct 23, 2012

by Fluffdaddy
If you expect to do more data analysis with Python in the future, the recommendation is probably to just learn pandas right now.

Cingulate
Oct 23, 2012

by Fluffdaddy

Seventh Arrow posted:

each line needs to take the value from the next row in the spreadsheet.
I don't understand what this means.

Cingulate
Oct 23, 2012

by Fluffdaddy

Seventh Arrow posted:

What I'm trying to say is that each folium line can't keep reading the first row over and over again. The first folium line needs to use row 1, the second one needs to use row 2, and so on.
Maybe you can describe what you need to happen, conceptually, but you can loop over the data frame. If you loop over a column (-> a pandas Series), it'll usually be equivalent to just looping over the content. E.g.,

code:
for value in df_raw['Latlng']:
    folium.do_something(value)
Although ideally, you'd vectorise that.

Sorry if I'm totally missing your point.

Cingulate
Oct 23, 2012

by Fluffdaddy

vikingstrike posted:

If you need to pull the value of the next row into the current row, then create a new column with shift(-1)?
Yeah maybe what you need is something like

code:
for first_val, second_val in zip(df_raw['Latlng'], df_raw['Latlng'].shift(-1)):
    folium.do_something(first_val, second_val)

Jose Cuervo posted:

If your dataframe has 'Lat', 'Long' and 'Description' columns, then I think this is what you might be looking for:

Python code:
for idx, row in df_raw.iterrows():
    folium.Marker([row['Lat'], row['Long']], popup=row['Description']).add_to(map_1)
or
Python code:
for lat, long, description in df[["Lat", "Long", "Description"]]:
    folium.Marker([lat, long], popup=description).add_to(map_1)

Cingulate
Oct 23, 2012

by Fluffdaddy

Seventh Arrow posted:

Sorry for the confusion. Maybe I can make it clearer: I need these lines "folium.Marker([x, y]..." populating the python script so they can put markers on the folium map. Except there's thousands of rows in the latitude/longitude csv, so I'm not going to write each folium line by hand.

So instead I need a way to get python to generate a bunch of "folium.Marker([x, y]..." lines, but also fill in the latitude/longitude information. Is that a bit better?

In the meantime, I'll take a look at your and Jose Cuervo's suggestions - thanks!


edit: of course, loading that much data into folium at once is another issue, but one thing at a time...
If the thing you need to do is indeed to go through the data line by line, and for each line, run the Marker thing on that line's values, then you could indeed do what I'm suggesting here:

Python code:
for lat, long, description in df[["Lat", "Long", "Description"]]:
    folium.Marker([lat, long], popup=description).add_to(map_1)
Also goes to Jose Cuervo.
(Can't vectorise if folium.Marker doesn't take array input.)

Seventh Arrow, what's throwing me off is you keep writing you want to "generate lines". But what you do want is to have Python go through the data and use the values, not literally create these lines of code, right?

Cingulate
Oct 23, 2012

by Fluffdaddy

Seventh Arrow posted:

Yes, I think so. Maybe a better way to put it is that I want folium to put a marker on the map for every lat/long coordinate in the csv. Whatever python voodoo it takes to do that is irrelevant to me (unless it actually involves sacrificing chickens on an altar).
Yeah then out of the solutions suggested so far, I think mine is the best.

I hope it's halfway intuitive what's going on - the
code:
df[["Lat", "Long", "Description"]]
part is extracting just these 3 rows of the data frame in just this order (if there are no other rows, it's pointless, then you just need to check the order), and the
code:
for lat, long, description in ...
part goes through the rows line by line and calls the first row's value lat, the second row's long, etc, and passes them onto the bod of the loop, where you can run your Folium function.

Cingulate
Oct 23, 2012

by Fluffdaddy

Seventh Arrow posted:

Ok, thank you. What happened to the "shift(-1)"? Is that no longer necessary?
That came out of a misunderstanding of what you wanted. It assumed that you wanted, in each iteration of the loop, the nth and the n+1th item.

Cingulate
Oct 23, 2012

by Fluffdaddy

Jose Cuervo posted:

I don't know that the code you have works. Running the following (I think equivalent) code results in 'ValueError: need more than 1 value to unpack'
Python code:
import pandas as pd

df = pd.DataFrame({'a': [1,2,3,4,5], 'b': [2,3,4,5,6], 'c': [3,4,5,6,7]})

for i, j, k in df[['a', 'b', 'c']]:
	print i, j, k
Ah yes, sorry. Make it

Python code:
for lat, long, description in df[["Lat", "Long", "Description"]].values:
    folium.Marker([lat, long], popup=description).add_to(map_1)
Without the
code:
values
, you're just iterating over the column labels.

Cingulate
Oct 23, 2012

by Fluffdaddy

QuarkJets posted:

There are lots of options but I've always found multiprocessing the easiest to use and most widely applicable. If you have no idea what to use and don't want to explain the problem further then maybe give that a shot
joblib is used in a bunch of scientific computing projects.

Cingulate
Oct 23, 2012

by Fluffdaddy

wasey posted:

How do I assign each line element to an object in Python
First of all, and apologies if this is a dumb question, but are you perfectly sure you actually do want that? Why?

Cingulate
Oct 23, 2012

by Fluffdaddy

wasey posted:

I'm the dumb one here, trying to learn Python on the fly for a class and it has been rough. I'm not sure that I want to do that, I just want to make sure that an activity's start and end time are not separated after I sort them by start time. I'm able to put the number of elements in a set in one list(11, 3) and the elements themselves into another list, but I don't know how to proceed from there
What do you want to do with the data? is it literally only sorting? Because if you want to do anything more with that, you'll probably want to use Pandas, or at least Numpy.

Once you have the data in either format, the sorting will be absolutely trivial (literally df.sort()), but it would be a bit more complicated to get the data in there due to the lone "3" in the 4th to last line.

So really, it depends a bit on what exactly you want to do.

Cingulate
Oct 23, 2012

by Fluffdaddy
In this case,
code:
print(*name)
will do the same though.

Cingulate
Oct 23, 2012

by Fluffdaddy

Slimchandi posted:

For clarity, I was referring to functions with side effects or return None rather than just callables. The given example was exactly what I intended.

It seems much more convenient than a for loop; is it just non-conventional or A Bad Thing?
I actually asked this very same question in I think the last thread. For the record, I've since come to understand the thread was perfectly correct to insist that I use a list comp to create a list and otherwise just do an ordinary loop. Much more pythonic!

Remember, it violates PEP8, but you can even do
[code]for counter in iterable: function_with_side_effects()[/cod]
Without the line break.

Cingulate
Oct 23, 2012

by Fluffdaddy
I was a bit surprised
code:
print(name.split(""))
does not work.

Cingulate
Oct 23, 2012

by Fluffdaddy

Dr Subterfuge posted:

map sounds like what you're talking about?
baka is, in fact, literally talking about map:

baka kaba posted:

consume(map(print, name)).

Buuuut why not just
code:
for thing in things: function(thing)
?

Cingulate
Oct 23, 2012

by Fluffdaddy

sofokles posted:

Thats what im after
Why? Why did you want to do that?

Cingulate
Oct 23, 2012

by Fluffdaddy

Boris Galerkin posted:

I want to make multiple plots so that I can save and style them indivudually, and then I wanna be able to pass them into one master plot where I’d arrange them into some logical order, and then save that final one too. How would I do that?

Something like,

code:
# first figure
fig, ax1 = plt.subplots()
ax1.plot(...)
# style the plot
fig.savefig(...)

# repeat for ax2 and ax3

# this is the part I don’t understand
# define master_figure with a layout like this:
# ax1 ax1 ax1
# ax2 ax2 ax3
# ax2 ax2 ax3

master_figure.add_ax1_to_proper_layout_location()
# repeat for others

# each subplot would keep the sane colors, styles, limits as already defined
master_figure.savefig(...)
Just to be clear, each ax object should be saved as a full figure that takes up the whole page, and then when I pass it into master_figure it should just automatically resize and put it onto the proper grid location. Everything else wrt limits, markers, etc, should not change.
code:
fig, axes = plt.subplots(n_columns, n_rows)

axes[0, 1].plot(something)
axes[1, 3].scatter(*data)
fig.savefig(name)
Like this?

Cingulate
Oct 23, 2012

by Fluffdaddy
You can try this stackexchange answer:
https://stackoverflow.com/questions/6309472/matplotlib-can-i-create-axessubplot-objects-then-add-them-to-a-figure-instance/46906599#46906599

But my suggestion would be to write a function that creates your axis, and give it an axis parameter. Then you call it once for its own figure, and another time for the joint figure. Much less awkward.

Cingulate
Oct 23, 2012

by Fluffdaddy
Generally, if you want to do something in Python and it’s awkward, more often than not you want to do the wrong thing. Just try it with a function, like I said.

Everyone and their moms uses notebooks. When I want to multiply two numbers and one of them has more than 3 digits, I open a notebook.
That is, if you analyse data.

Cingulate
Oct 23, 2012

by Fluffdaddy
The basic math thing was hyperbole.

Well, what is it that you need to do? Complex plotting things, with multiple open figures, a re a great scenario for notebooks. If you frequently get back to the data itself, even better.

In other contexts, other tools will be superior. It depends.

Cingulate
Oct 23, 2012

by Fluffdaddy
Can you show what the file looks like? I am pretty optimistic this can be solved with 3 lines of pandas.

Cingulate
Oct 23, 2012

by Fluffdaddy
I'm trying to look up what a movie professionals primary occupation is.

Python code:
searchstr = "nconst == @person"
find = lambda person: df_movies.query(searchstr)["category"].mode()[0]
df_actors["main"] = df_actors["nconst"].map(find)
("nconst" is an identifier, "category" is the job on that movie)

It's not too slow, but could it go faster?

Edit: obvious suggestion would be kicking out all the porn movies.

edit 2: Oh wow, I switched to df.groupby and now it's much faster.

Cingulate fucked around with this message at 20:37 on Feb 28, 2018

Cingulate
Oct 23, 2012

by Fluffdaddy

Sad Panda posted:

Next part of my Blackjack program. A lookup table. A short extract of the data would be...


code:
        2   3   4   5   6   7   8   9   T   A
Hard 5  H   H   H   H   H   H   H   H   H   H
Hard 6  H   H   H   H   H   H   H   H   H   H
Hard 7  H   H   H   H   H   H   H   H   H   H
Hard 8  H   H   H   H   H   H   H   H   H   H
Hard 9  H   D   D   D   D   H   H   H   H   H
Hard 10 D   D   D   D   D   D   D   D   H   H
I want to be able to input 2, Hard 6 and it return H.

My original idea was 2D arrays, but that doesn't seem to support a column name which is what I'd call that 2/3/4/.. at the top. I found one solution, and he used a Pickled 'av table' (so the variable name suggests), but that seems a bit beyond me right now.
In pandas, that would be df.loc["Hard 6", 2]

Cingulate
Oct 23, 2012

by Fluffdaddy
I think that function is overkill. Either that, or it's a bit inefficient to always read in the csv whenever you want to retrieve one single number.

You could also write an input check that gracefully fails whenever you request a combination that doesn't exist.

In the long run, you probably want to construct a class (to handle state), and that class could store the df, and that function could be a method.

Cingulate
Oct 23, 2012

by Fluffdaddy

Sad Panda posted:

You're right about the reading of the CSV. I've moved that outside of the function so it should just happen at load when the df object is created and then stay loaded in cache right?
Yes, if you define the object in public namespace, before the function is created, it will be available for calls to the function. You could also pass the data frame to the function as an argument, perhaps as a default.

Though as I said, you will probably end up using a class or two.

Cingulate
Oct 23, 2012

by Fluffdaddy
You mean like this?

code:
df1 = pd.DataFrame({'A': ['a', 'b', 'c', 'd'],
                    'B': ['.']*4},
                    index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['a', 'c', 'd'],
                    'B': ['one', 'two', 'three']},
                    index=[0, 1, 2])
df1b = df1.set_index("A")
df1b["B"] = df2.set_index("A")["B"]

print(df1b)
       B
A       
a    one
b    NaN
c    two
d  three
(The assignment aligns on index.)

Cingulate
Oct 23, 2012

by Fluffdaddy
... there is a non-dark theme in pycharm?

Cingulate
Oct 23, 2012

by Fluffdaddy

Dr Subterfuge posted:

Instead of X it would probably be easier to use pandas
Thread title

Cingulate
Oct 23, 2012

by Fluffdaddy
edit: I am incredibly stupid

Cingulate fucked around with this message at 07:00 on Apr 12, 2018

Cingulate
Oct 23, 2012

by Fluffdaddy
I'm setting up a static website, hopefully so it's community editable. So I'm thinking markdown and GitHub. I only really know Python, so I went for Pelican, but now I'm wondering if I shouldn't just suck it up and use Jekyll. Thoughts from the more experienced people ITT?

Cingulate
Oct 23, 2012

by Fluffdaddy

Thermopyle posted:

Are you sure you want something like that instead of a wiki? A wiki is the first thing I thought of when you said "community editable".
For various reasons, no. It's in part representation.

Also I want to force feed GitHub and markdown down their throats anyways :colbert:

Adbot
ADBOT LOVES YOU

Cingulate
Oct 23, 2012

by Fluffdaddy

bamhand posted:

I think it's a dataframe? I'm modifying some old code where I'm just changing the source of the data but trying to keep the plots the same. This is what was working before (also I'm pretty new to python in general):
code:
a1 = self.est_data_obj.tsa_data[self.est_data_obj.endog][self.est_data_obj.bgn_obs : self.est_data_obj.end_obs]
        a2 = self.reg_results.fittedvalues[self.est_data_obj.bgn_obs : self.est_data_obj.end_obs]
        a1.plot(
                        title='In-sample Fit Chart', 
                        linewidth=1,
                        linestyle='-', 
                        marker='d',
                        markersize=5,
                        fillstyle='none',
                        color='cornflowerblue', 
                        label='Original Dependent Values')

        a2.plot(
                        kind='line',
                        linewidth=1,
                        marker='o',
                        markersize=5,
                        fillstyle='none',
                        linestyle='-', 
                        color='red', 
                        label='Predicted Dependent Values')
                        
        plt.xlabel('Date')
        plt.ylabel(self.var_dic[self.est_data_obj.endog], size=10)
        plt.legend() 
        plt.show()
This is the new code I added for changing the data, where sd2df creates a dataframe from saspy:

a = self.sas.sd2df("reg_out", "work").loc[:,["date","heloc2","pred"]]
a=a.loc[a["heloc2"].notnull(),]
pd.DataFrame's plot method takes an ax argument. I.e., do

axes = pt.axes()
df_1.plot(..., ax=ax)
df_2.plot(..., ax=ax)

and both calls will show up in the same axes.

Better: use seaborn instead.
If you want to plot predicted vs. actual, consider using seaborn.jointplot.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply