Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Jose Cuervo
Aug 25, 2004
I am trying to understand OrderedDicts. I am reading the documentation and am having trouble understanding what the following line is saying:

"The OrderedDict constructor and update() method both accept keyword arguments, but their order is lost because Python’s function call semantics pass-in keyword arguments using a regular unordered dictionary."

Is this saying that if I add items to the OrderedDict as follows
Python code:
from collections import OrderedDict

q = OrderedDict()
q[1] = 'first'
q[2] = 'second'
q[3] = 'third'

for i, j in q.iteritems():
	print i, j
then there is no guarantee that what will be printed is
code:
1 first
2 second
3 third
??

Adbot
ADBOT LOVES YOU

Jose Cuervo
Aug 25, 2004
Got it, thanks!

Jose Cuervo
Aug 25, 2004

The Gunslinger posted:

I just uninstalled Python and installed Anaconda instead. Having the same problem though.

code:
C:\Users\TGS\Desktop\Python>python bob.py
Traceback (most recent call last):
  File "bob.py", line 8, in <module>
    import turtle
  File "C:\Users\TGS\Desktop\Python\turtle.py", line 2, in <module>
    bob = turtle.Turtle()
AttributeError: module 'turtle' has no attribute 'Turtle'
Anaconda is installed and has set itself in my Path. I just want to be able to work from any directory and import libraries from the standard libs. Do I need to specify the folder when I do this? I don't really understand why its looking for turtle.py inside of that directory instead of in C:\users\TGS\Anaconda3. After setting the PYTHONPATH it still does the same thing.

I can only import sys and print sys.path if I run python from a shell in C:\users\TGS. It shows the default Anaconda install directory, scripts sub folders and etc.

I just want to be able to run scripts from any directory instead of being locked to my user directory.

I think you have a file named turtle.py in the same folder as the file bob.py. So when it looks to find the turtle module, it "finds' it where it first looks - in the same folder, and never gets to look in the folder where Python is installed.

Jose Cuervo
Aug 25, 2004
I am using Python 2.7 with PyCharm Community Edition 2016.3.2. I have the following snippet of code:

Python code:
class ParentNode(node.Node):
    """ParentNode is a subclass of Node but also takes an additional child_nodes parameter.
    This class also contains queues (maintenance_queue, available_queue, en_route_queue)
    formed from dictionaries, and a list of TMRs that have occurred.
    @type child_nodes: dict[str: child_node.ChildNode]
    """
    def __init__(self, name, node_type, FSPs, CTT_distributions, TMR_distributions,
                 prob_on_convoy, rep_rndstrm, child_nodes):
        """ParentNode is a subclass of Node but also takes an additional child_nodes parameter.
        This class also contains queues (maintenance_queue, available_queue, en_route_queue)
        formed from dictionaries, and a list of TMRs that have occurred.
        @type child_nodes: dict[str: child_node.ChildNode]
        """
        node.Node.__init__(self, name, node_type, FSPs, CTT_distributions, TMR_distributions,
                           prob_on_convoy, rep_rndstrm)
        self.network_status = 'parent'
        self.child_nodes = child_nodes
The issue is that when I hover over self.child_nodes or child_nodes, the inferred type is shown as 'Any' instead of 'Dict[str, ChildNode]'. I don't understand why the typehinting I have in the docstring does not work in this case.

Jose Cuervo
Aug 25, 2004

QuarkJets posted:

Try using a comma instead of a colon in the typehinting

Forgot to say that this is what the problem was. Thanks!

Jose Cuervo
Aug 25, 2004

Eela6 posted:

Why not? I'd love to know more anything about flask

Me too.

Jose Cuervo
Aug 25, 2004

FingersMaloy posted:

I'm still trying to make this scraper work. I've abandoned BeautifulSoup and totally committed to Scrapy, my spider works but I can't make it pull the exact pieces I need. I'm using this code as my guide, but it's not fully working:

https://github.com/jayfeng1/Craigslist-Pricing-Project/blob/master/craigslist/spiders/CraigSpyder.py

He explains the methodology here:

http://www.racketracer.com/2015/01/29/practical-scraping-using-scrapy/

Python code:
# -*- coding: utf-8 -*-
import scrapy
from craig.items import CraigItem

class BasicSpider(scrapy.Spider):
    name = "basic"
    allowed_domains = ["https://cleveland.craigslist.org/"]
    start_urls = ['https://cleveland.craigslist.org/search/hhh?query=no+section+8&postedToday=1&availabilityMode=0']

    def parse(self, response):
        titles = response.xpath("//p")
        for titles in titles:
            item = CraigItem()
            item['title'] = titles.xpath("a/text()").extract()
            item['link'] = titles.xpath("a/@href").extract()
            items.append(item)
            follow = "https://cleveland.craigslist.org" + item['link']
            request = scrapy.Request(follow , callback=self.parse_item_page)
            request.meta = item
            yield request
        
    def parse_item_page(self, response):
        maplocation = response.xpath("//div[contains(@id,'map')]")
        latitude = maplocation.xpath('@data-latitude').extract()
        longitude = maplocation.xpath('@data-longitude').extract()
        if latitude:
            item['latitude'] = float(latitude)
        if longitude:
            item['longitude'] = float(longitude)
        return item
On line 18 (follow =), I get: TypeError cannot concatenate 'str' and 'list' objects.

I run this command to execute the program: scrapy crawl basic -o items.csv -t csv. If remove the second method I can get a spreadsheet with titles and links, but I need the geotag.

Any ideas?

The error message is telling you that you are trying to concatenate a string and a list - apparently item['link'] is a list. I think your error is your for statement ("for titleS in titles:"), where I think you actually want to say "for title in titleS:", and then change occurrences of titleS in the for loop to title.

Jose Cuervo
Aug 25, 2004
I have read a large amount of data into a Pandas DataFrame from multiple .csv files (where each .csv file contained the same data but for different years). There are columns corresponding to test scores in the DataFrame which should all be numbers, however based on a df[col_name].unique() call it is clear that some files stored the numbers in the columns as numbers, some as strings, and sometimes when there wasn't a test score text was used ('No Score').

Because of this, I cannot use the .astype() functionality to convert the column to floats. What would be a good way to go about identifying the entries that are stored as strings and converting them to floats while at the same time identifying the entries that are text and replacing those entries with nan?

Jose Cuervo
Aug 25, 2004

vikingstrike posted:

code:

import pandas as pd

frame = pd.read_csv('my_data.csv')

frame = frame.assign(More_Than_One=(frame.party_size > 1))
frame = frame.assign(Total_Cost=frame.party_size * frame.ticket_price)


Or even simpler:

code:

import pandas as pd

df= pd.read_csv('my_data.csv')

df['More_Than_One'] = df['party_size'] > 1
df['Total_Cost'] = df['party_size'] * df['ticket_price']

Jose Cuervo
Aug 25, 2004
I wrote a small data processing script a while back and since then I have updated pandas. Now the script does not work. Short of creating a venv for a single script, is there any way of telling my system to use a specific version of pandas when running the script?

Jose Cuervo
Aug 25, 2004
My use case is that I need to rerun a bit of analysis for a paper in response to a review that requires something slightly different or with a updated set of data. I don't think it is worth my time to fix the script that I might have written 6 months ago for a one time run, so it seems like the venv is the way to go.

Jose Cuervo
Aug 25, 2004

Seventh Arrow posted:

What I'm trying to say is that each folium line can't keep reading the first row over and over again. The first folium line needs to use row 1, the second one needs to use row 2, and so on.

If your dataframe has 'Lat', 'Long' and 'Description' columns, then I think this is what you might be looking for:

Python code:
for idx, row in df_raw.iterrows():
    folium.Marker([row['Lat'], row['Long']], popup=row['Description']).add_to(map_1)

Jose Cuervo
Aug 25, 2004

Fair enough - but how would you vectorize what Seventh Arrow wants to do?

Edit: Or are you saying use something like:
Python code:
for lat, long, description in zip(df['Lat'], df['Long'], df['Description']):
    folium.Marker([lat, long], popup=description).add_to(map_1)

Jose Cuervo fucked around with this message at 17:45 on Jan 10, 2018

Jose Cuervo
Aug 25, 2004

Cingulate posted:

If the thing you need to do is indeed to go through the data line by line, and for each line, run the Marker thing on that line's values, then you could indeed do what I'm suggesting here:

Python code:
for lat, long, description in df[["Lat", "Long", "Description"]]:
    folium.Marker([lat, long], popup=description).add_to(map_1)
Also goes to Jose Cuervo.
(Can't vectorise if folium.Marker doesn't take array input.)

Seventh Arrow, what's throwing me off is you keep writing you want to "generate lines". But what you do want is to have Python go through the data and use the values, not literally create these lines of code, right?

I don't know that the code you have works. Running the following (I think equivalent) code results in 'ValueError: need more than 1 value to unpack'
Python code:
import pandas as pd

df = pd.DataFrame({'a': [1,2,3,4,5], 'b': [2,3,4,5,6], 'c': [3,4,5,6,7]})

for i, j, k in df[['a', 'b', 'c']]:
	print i, j, k

Jose Cuervo
Aug 25, 2004

Seventh Arrow posted:

Ok I think I'm almost there. I've managed to twist its arm enough that it will post the map marker with a colour...once. So I think I'm in the ballpark, but I get a map with only one marker. Here's what I have so far:

code:
import pandas as pd
import folium

df = pd.read_excel('df_condo_v9_t1.xlsx', sheetname=0, header=0)

df.shape

df.dtypes

map_center = [df["Latitude"].mean(), df["Longitude"].mean()]

map_1 = folium.Map(location=map_center, tiles="Stamen Terrain", zoom_start=12)

class color:
    def out(self):
        print("successful.")

for i, row in df[["Latitude", "Longitude", "Price"]].dropna().iterrows():
    position = (row["Latitude"], row["Longitude"])
    pr = (row["Price"])
    cl = color
    if pr < 999:
        cl = 'green'
    elif pr < 2000:
        cl = 'yellow'
    elif pr < 3000:
        cl = 'orange'
    else:
        cl = 'red'
folium.Marker(position, icon=folium.Icon(color=cl)).add_to(map_1)

map_1
I think this could be due to one of two things:

A) The "dropna" and "iterrows" are working their magic with the "position" variable so that when the process gets around to "pr", almost all of the rows have been dropped. If this is true, then I don't know enough about python to properly address it.
B) It could be something folium-specific, since jupyter displays the following: <folium.map.Marker at 0x7efe9fd758d0>

Also, the "class color" thing is something I got off of Stack Overflow. If I don't put it in there, it says that "color is not defined."



You need to place the line

folium.Marker(position, icon=folium.Icon(color=cl)).add_to(map_1)

inside the for loop (so basically tab it over once).

Right now the only thing being added to the map is the last location.

EDIT: Also it looks like you chose to use the iterrows() solution that I proposed, but as pointed out that is probably the slowest way to do things, and you should probably use one of the other suggested methods.

Jose Cuervo fucked around with this message at 02:20 on Jan 24, 2018

Jose Cuervo
Aug 25, 2004

Wallet posted:

I have another potentially stupid question that I can hopefully explain in something resembling a comprehensible fashion:

Let's say I have a paragraph of text with hard line breaks in a list, with each line as an item:
code:
['There was no hope for him this time: it was the third stroke. Night',
'after night I had passed the house (it was vacation time) and studied',
'the lighted square of window: and night after night I had found it lighted',
'in the same way, faintly and evenly. If he was dead, I thought, I would see',
'the reflection of candles on the darkened blind for I knew that two candles',
'must be set at the head of a corpse.']
And I generate a separate list that is storing that text as sentences:
code:
['There was no hope for him this time: it was the third stroke.',
'Night after night I had passed the house (it was vacation time) and studied the lighted square of window: and night after night I had found it lighted in the same way, faintly and evenly.',
'If he was dead, I thought, I would see the reflection of candles on the darkened blind for I knew that two candles must be set at the head of a corpse.']
And I want to do some stuff with the sentences. Like, for an arbitrary example, let's say that I want to count the number of letters in each word in the sentences. To be clear, the actual stuff I'm doing requires that the text be broken out into sentences. Let's say I'm storing the counts in nested lists:
code:
[[5, 3, 2, 4, 3, 3, 4, 4, 2, 3, 3, 5, 6], [5, 5, 5, 1, 3, ...], [...]]
Now, let's say that I want to insert those letter counts back into the original version that was broken up by lines, which might look something like this (pretend I did the whole thing):
code:
['There{5} was{3} no{2} hope{4} for{3} him{3} this{4} time{4}: it{2} was{3} the{3} third{5} stroke{6}. Night{5},',
'after{5} night{5}...']
Is there some clever way to accomplish this? The best I can come up with is using word count/position to sort of reconstruct where the numbers should be inserted, since the third word in the sentence version is still the third word in the line version, but I dunno if I'm missing something.

You could generate a single list of all the numbers, then iterate through the original lines, and for each word on each of the original lines, pop a number from the single list to attach to the end of the word.

Edit: Something along these lines:
Python code:
new_text = ''
for line in all_lines:
    for word in line:
        new_text += word + '{' + str(all_numbers.pop(0)) + '} '
    new_text += '\n'

Jose Cuervo fucked around with this message at 15:01 on Jan 26, 2018

Jose Cuervo
Aug 25, 2004
I am trying to understand what I am doing wrong with a Pandas merge of two data frames:

Python code:
merged_df = df_one.merge(df_two, how='left', on=['ID', 'Date'])
My understanding of this merge is that merged_df would have exactly the same number of rows as df_one because I am merging only on keys that exist in df_one (i.e., each ('ID', 'Date') tuple from df_one is iterated over and if that tuple exists as a row in df_two, then that data is copied over, and ('ID', 'Date') tuples in df_two that don't exist in df_one are ignored, correct?). However, when I run this code with my data, merged_df ends up having more rows than df_one has, and I don't know what I am doing wrong.

Jose Cuervo
Aug 25, 2004

vikingstrike posted:

Is df_two unique on ID, Date?

Are you asking if all the (ID, Date) tuple combinations in df_two are unique? If so, yes. df_two was generated using groupby where on=['ID', 'Date'].

Jose Cuervo
Aug 25, 2004

vikingstrike posted:

Er, sorry. Unique was a bad word. In df_two does (ID, Date) ever include multiple rows?

No, because df_two was generated using a groupby statement where on=['ID', 'Date'] - so every row in df_two corresponds to a unique ('ID', 'Date') tuple.

Jose Cuervo
Aug 25, 2004

vikingstrike posted:

Try the indicator flag on the merge and then use it to see if it might lead you to where the extra rows are coming.

Just wanted to say you got me thinking about having duplicates, and df_one actually had non-unique ('ID', 'Date') tuples, so that is where the problem was. Thanks for the help.

Jose Cuervo
Aug 25, 2004
Here is my folder structure:

code:
/scripts
	input_generation.py
/output
	/run_1
		/code
			input_generation.py
test_script.py
In test_script.py, when I write
Python code:
import input_generation as ig
I would like to import output/run_1/code/input_generation.py and not scripts/input_generation.py. Is there a way of achieving this?

Jose Cuervo
Aug 25, 2004

Dr Subterfuge posted:

You'll want to put an __init__.py everywhere so can import your folders as packages

which will allow you to do this

Python code:
import output.run_1.code.input_generation as ig

Thanks.

Am I correct in thinking that with the following structure and __init__.py placement:

code:
/scripts
	input_generation.py
/output
	__init__.py
	/run_1
		__init__.py
		/code
			__init__.py
			input_generation.py
test_script.py
the statement
Python code:
import input_generation as ig
would use scripts/input_generation.py, while
Python code:
import output.run_1.code.input_generation as ig
would use output/run_1/code/input_generation.py?

But with the following structure and __init__.py placement:

code:
/scripts
	__init__.py
	input_generation.py
/output
	__init__.py
	/run_1
		__init__.py
		/code
			__init__.py
			input_generation.py
test_script.py
the statement
Python code:
import input_generation as ig
would throw an error?

Jose Cuervo
Aug 25, 2004

Boris Galerkin posted:

Helpful explanation.

Got it. Dr Subterfuge's suggestion of adding the scripts folder to the path does make what I posted work, and seems much simpler than having to rename the scripts folder.

Jose Cuervo
Aug 25, 2004
I have a Jupyter notebook where I have a set of 3 subplots which share an x-axis. The x-axis is time in weeks, and I would like to make the plot scrollable (ideally) with two buttons at the bottom - a forward button which increases the upper and lower bound on the x-axis by 4 weeks, and backward button which decreases the upper and lower bound on the x-axis by 4 weeks.



This is the code i use right now to generate a static image in the notebook cell:
Python code:
def plot_history(months_back):
    df.index = df['Date']
    df = df.sort_values(by='Date')
    most_recent_date = df.iloc[df.shape[0] - 1]['Date']
    if most_recent_date.day == 1:
        start_date = most_recent_date - pd.tseries.offsets.MonthBegin(months_back)
    else:
        start_date = most_recent_date - pd.tseries.offsets.MonthBegin(months_back + 1)
    if start_date.dayofweek != 0:
        start_date -= pd.tseries.offsets.Week(weekday=0)
    df = df[start_date:most_recent_date]
        
    # Plots
    height_ratios = [1.25, 3, 3]
    fig, subplot = plt.subplots(nrows=len(height_ratios), ncols=1, 
                                figsize=(20, 15), sharex='all',
                                gridspec_kw={'height_ratios': height_ratios})

    plot_subplot_0(subplot[0], df)
    plot_subplot_1(subplot[1], df)
    plot_subplot_2(subplot[2], df)

    set_x_axis_attributes(start_date, most_recent_date)

    plt.tight_layout()
    plt.show()
    plt.close()


def set_x_axis_attributes(start_date, most_recent_date):
    """
    Sets
    1. The limits on the x-axis,
    2. Formats the ticks so that they occur every Sunday, and
    3. Formats the tick labels to only show the year when relevant.
    """
    plt.xlim(start_date - pd.tseries.offsets.Day(1), most_recent_date + pd.tseries.offsets.Week(weekday=0))
    locs, labels = [], []
    date = start_date
    year = start_date.year
    while date < most_recent_date + pd.tseries.offsets.Week(weekday=6):
        locs.append(date)
        labels.append(date.strftime('%b-%d'))
        date += pd.tseries.offsets.Week(1)
        if date.year > year:
            labels[-1] += ('\n%i' % year)
        year = date.year

    # Show the year for the last tick label
    date -= pd.tseries.offsets.Week(1)
    labels[-1] += date.strftime('\n%Y')
    
    plt.xticks(locs, labels, fontsize=15)
I think I can use ipywidgets to accomplish this, but I am not sure on how to structure the code to this to achieve what I want. Any pointers appreciated.

Jose Cuervo
Aug 25, 2004
I taught a class for a colleague last week that used Jupyter notebooks. Each student was told to download the notebook from the class website, and then we went through the notebook in class and the students had cells where they had to type their own code.

One of the issues I encountered when plotting using seaborn was that not everyone had the exact same plot - i.e., the scatter plot matrix looked slightly different between students (the underlying shape etc was correct, but the presentation was different). I think this came down to the fact that not everyone was using Python 3 like I was, and not everyone had the same version of seaborn installed.

Another issue was pandas.cut() worked slightly differently for everyone because of changes between versions.

Question: Is there a standard / simple way in the Notebook to ensure that everyone uses the same version of Python, and the same version of the packages being imported?

Jose Cuervo
Aug 25, 2004

Symbolic Butt posted:

Installing Anaconda seems to be the best solution for this in my experience.

What does this process look like? Get everyone in the class to download Anaconda on day one of the class, then...?

OnceIWasAnOstrich posted:

Run a JupyterHub server and set everyone to use one consistent environment?

The documentation for this looks like it is tailor made for this, thanks.

Jose Cuervo
Aug 25, 2004
Not sure if this is the right place to ask this, but I want to generate a time series which simulates consumption of a product. One way of doing this is to assume that the consumption has a particular distribution, say Triangular(lower=2, mode=4, upper=5), and then at each time step draw a random variate from that distribution to simulate the amount consumed during that time step. However, generating the time series in this way does not produce any correlation in consumption between successive time steps. That is, if the consumption of the product was on the higher end of the distribution at time t, then the consumption of the product at time t+1 should likely be on the higher end of the distribution as well, and vice versa.

How would I go about adding the correlation aspect to the simulated time series?

Jose Cuervo
Aug 25, 2004

CarForumPoster posted:

If you’re generating based on random variables (0,1) that follow a triangular CDF, you could make an if/then that looks at the random variable from t-1 and keeps generating numbers until the random number is within x distance/percentage/etc of the previous


Morek posted:

What is the usage scale you're actually trying to simulate? Hundreds of use sessions, millions, billions?
How often do you need to run this simulation?
Getting a pin on that will help tell you what kind of abstractions/simplifications of the problem will work for your purposes.

If the answer was "less than a few million" and "once", i'd probably take a swing at directly simulating users and the properties of their sessions.

Some data structure of active users and their session states, at simulation clock tick you walk your set and simulate the user doing things or ending their session and new users signing in.
Nice thing about this sort of approach is it gives you easy access to a bunch of knobs that make logical sense to the business ("How long do users stay logged in?") and also easily tune down fidelity/tune up scale as needed.


Spime Wrangler posted:

You don't really give enough info to specify the problem. Does the triangle distribution describe instantaneous total consumption at time t, or does it describe inter-arrival times of consumers, or per-instance time-required-to-consume?

If it's the first then maybe describe the usage you want to simulate as the combination of a deterministic function and a random function, e.g.: consumption(t) = sin(t) + gaussian(mu=0, sigma=1)

Or maybe some kind of random walk?

Sorry for not providing enough information. The project involves simulating the amount of fuel consumed at a forward operating base each day. During periods of intense activity there are several or more days of high daily consumption, while during lull periods there are stretches of several days of low daily consumption. Historical consumption information is classified information, so I am have to generate a consumption time series on my own. The information I can get from the subject matter experts is a minimum, maximum, and mode for the amount of fuel consumed per day, which is why I mentioned the Triangular distribution (although I suppose I could also use a Beta distribution).

Right now a generate a time series of the daily fuel consumption (180 days of daily consumption) by drawing a random variate from a Triangular(minimum, mode, maximum) distribution 180 times. However, generating a time series in this way does not, in general, result in the stretches of correlated activity that would be in the real time series. And I was trying to figure out a way to achieve this correlation, while still being able to say that the underlying distribution of the daily consumption is Triangular(minimum, mode, maximum).

Jose Cuervo
Aug 25, 2004
I am trying to compute a kernel density estimate for the rate of incidents around the state of Virginia, similar to what is done here.

My previous attempts with using 'euclidean' as the metric produces this KDE plot:


I then realised that I should be using 'haversine' as the metric because I have a two dimensional vector space as described here. I have had to modify the code from the Python Data Science Handbook example because I am using geopandas to plot the state map instead of the matplotlib basemap, and my data is in a dataframe and not numpy arrays like the data seems to be in the linked example.

Here is my code:
Python code:
# Plot map of Virginia
fig, subplot = plt.subplots(figsize=(20, 10))
subplot.set_aspect('equal')
subplot.axis('off')
subplot.grid(False)    
va_gdf.plot(ax=subplot, facecolor='none', alpha=0.75, linewidth=1.5, edgecolor='#444444')
va_counties_gdf.plot(ax=subplot, facecolor='none', alpha=0.75, linewidth=1.5, edgecolor='#444444')

# [url]https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html[/url]  
# Peform the kernel density estimate   
kde = KernelDensity(bandwidth=.03, metric='haversine')  # Haversine requires latitude, longitude order
latlon = np.radians(np.vstack([df['Latitude'], df['Longitude']]).T)
kde.fit(latlon)

minx, miny, maxx, maxy = va_gdf.total_bounds
X, Y = np.mgrid[minx:maxx:100j, miny:maxy:100j]
xy = np.radians(np.vstack([X.ravel(), Y.ravel()]).T)
Z = np.exp(kde.score_samples(xy))
Z = Z.reshape(X.shape)

subplot.scatter(x=df['Longitude'], y=df['Latitude'], s=10, alpha=0.5)
subplot.imshow(Z, cmap=cmap_name, extent=[minx, maxx, miny, maxy], alpha=0.5)
plt.tight_layout()
plt.show() 
and this is the result


There are about 8000 points plotted on the map with some well defined clusters of points. Unfortunately the KDE plot does not seem to be picking up on them as it does in the linked example. I have tried playing with the bandwidth for the KDE but this does not seem to change things for the better - for example when the bandwidth is set to 0.3 this is the resulting KDE plot:


I was expecting the KDE using 'haversine' as the metric to be slightly different to the original KDE plot, but still somewhat similar. I think that the KDE plots I have now are incorrect but I cannot tell what I am doing wrong. Thoughts?

Jose Cuervo
Aug 25, 2004

bob dobbs is dead posted:

eucliean distance metric is fine unless you're using the map for navigation tho?

2-space is within the set of n-spaces, yes

what do the latitude and longitude data points look like?

From the Python Data Science Handbook it seemed to say that you should use 'haversine' when performing KDE where the points are latitude and longitude - that is why I went from using 'euclidean' to 'haversine'. By 'fine' do you mean the error in the distances between points (because the distances will not be great circle distances but just straight line distances) is small enough to be ignored if you are not trying to navigate between points?

I am not sure what the '2-space is within the set of n-spaces, yes' comment means exactly.

The Longitude minimum and maximum are -83.6311 and -75.3771, while the Latitude maximum and minimum are 36.5454 and 39.4172. I believe they are in units of decimal degrees. Is this what you were asking?

Jose Cuervo
Aug 25, 2004

bob dobbs is dead posted:

what i thought that you were thinking is that "oh, you can't use euclidean at all for this", but the real statement to make is "euclidean will introduce distortions in this. but if you're overlaying it over a flat projection in the first place, the distortions will basically look like the map distortions". the other possible misunderstanding is just because euclidean distance is viable for any dimensional space (n-spaces) you think you can't just use it for 2-dimensional space (2-space)

also, the underlaying mistake is prolly you need it radians

Thanks for the clarification regarding the 'euclidean' versus 'haversine' issue.

I am using radians though: in the code I posted I convert the latlon values to radians using numpy.radians() and I do the same for the sample points with the line
Python code:
xy = np.radians(np.vstack([X.ravel(), Y.ravel()]).T)
which is why I am so confused.

EDIT: But since it seems reasonable to use 'euclidean' I will just use that and not worry about why this is not working.
Edit2: Does posting on the forums count as duck debugging? I noticed that I had the sample points as (long, lat) pairs, not (lat, long) pairs, i.e. the line
Python code:
xy = np.radians(np.vstack([X.ravel(), Y.ravel()]).T)
should be
Python code:
xy = np.radians(np.vstack([Y.ravel(), X.ravel()]).T)
That was the issue.

Jose Cuervo fucked around with this message at 20:40 on Dec 7, 2018

Jose Cuervo
Aug 25, 2004
I am trying to parallelize some code I have. I have a dictionary with 5500 entries (but this number will grow to about 20000 entries later on), where each entry is a pandas series that is 288 rows in length. I have to run a function Y on each combination of entries in the dictionary, where Y computes the euclidean distance between the two series and is a fairly computationally inexpensive task. However, even with 5500 entries this comes out to approximately 15 million calls to function Y and I would like to parallelize it. I am looking into using joblib to make use of all 6 cores on my computer and I am wondering if

1. The best way to do this is to split the list of combinations into 6 approximately equal sublists, and then use joblib to run the computation for each list on a single core, and
2. Is it possible to have a single copy of the dictionary that can be accessed by each process, rather than having six copies of this large dictionary?

Jose Cuervo
Aug 25, 2004
I have a set of 4 Jupyter Notebooks which, when run one after the other, act like a pipeline (i.e., the output from the end of the first notebook is read in by the second notebook, and so on). Is there a way to write a python script to call each notebook in order, while waiting for each notebook to finish before starting the next?

EDIT: After some searching I just found papermill which it looks like will do exactly what I want.

Jose Cuervo fucked around with this message at 19:02 on Jun 1, 2020

Jose Cuervo
Aug 25, 2004
I have a list of timestamps which represent the number of seconds since January 1st 2008. I would like to convert these into a pandas series of date times (e.g., 2020-09-27 22:13:56).

Looking through stack overflow I have found this question which uses the datetime library to convert a timestamp in seconds to a "readable" timestamp. However the datetime.datetime.fromtimestamp() function seems to be defined as seconds since January 1st 1970.

Is converting my timestamp using the datetime.datetime.fromtimestamp() as is, and then adding on the number of days between Jan 1st 1970 and Jan 1st 2008 (via pandas.DateOffset(days=num_days) where num_days is the number of days between the two dates) the best way to go about this?

Jose Cuervo
Aug 25, 2004

Bad Munki posted:

Logically, it may be clearer to find the offset of your epoch from the standard epoch and add that number of seconds to the times you have, and then convert THAT to a standard datetime object, just because then you're not calculating a known-to-be-wrong datetime at any point, but tomato tomato.

Yep, that makes more sense. Will do that.

Jose Cuervo
Aug 25, 2004

QuarkJets posted:

Pandas can generate timestamps from a series of integers, using whatever epoch you want, interpreting them in whatever units you want (the default is nanoseconds)

Python code:
# The read-in array is "sec_from_epoch"
# Warning: I assume seconds from 00:00, but plenty of epochs use 12:00
timestamps = pd.to_datetime(sec_from_epoch, unit='s', origin="2008-01-01")
Is this what you mean? For instance passing in [10, 20] returns a DatetimeIndex(['2008-01-01 00:00:10', '2008-01-01 00:00:20'])

Ah, excellent. That is even simpler than what I was proposing to do. Thanks for pointing this out - I will change this part of my code from what I was doing.

Jose Cuervo
Aug 25, 2004
I want to build what I think is a pretty simple database using SQLite and SQLAlchemy.

The database is going to hold a subset of data from a different database where each patient already has a unique numeric identifier (Medical Record Number, MRN).

The database is intended to hold information relevant to someone undergoing dialysis. In addition to the usual demographic information the patient will also have
1. A diabetic status (which can change over time from No to Yes)
2. Lab values (these patients get labs run at least once a month)
3. Hospitalizations

Below is what I have started to code up. In particular, I want to know if
1. I have defined the primary key in the Patient class correctly (I want to use the MRN as the primary key because data being added to the database can be uniquely identified as belonging to a patient with that MRN),
2. I have defined the relationships correctly,
3. I have defined the Foreign Keys correctly, and
4. What I need to add so that when a particular patient is removed from the 'patients' table (see below), all their associated data from the diabetic and hospitalizations tables is also removed.

Python code:
from sqlalchemy import create_engine, Column, Integer, Float, \
                        String, Date, Boolean, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship


Base = declarative_base()

class Patient(Base):
    __tablename__ = 'patients'

    id = Column(Integer)
    MRN = Column(Integer, primary_key=True)
    first_name = Column(String)
    last_name = Column(String)
    date_of_birth = Column(String) 

    diabetic = relationship("Diabetic", back_populates='patient')
    lab_values = relationship('LabValues', back_populates='patient')
    hospitalizations = relationship('Hospitalization', back_populates='patient')

    def __repr__(self):
        return f'<Patient({self.last_name}, {self.first_name}' \
               f' ({self.MRN}, {self.date_of_birth})>'
    
    
class Diabetic(Base):
    __tablename__ = 'diabetic'
    
    id = Column(Integer, primary_key=True)
    date = Column(Date)
    diabetic = Column(Boolean)
    MRN = Column(Integer, ForeignKey('patients.MRN'))
    
    patient = relationship('Patient', back_populates='diabetic')   


class LabValues(Base):
    __tablename__ = 'lab_values'
    
    id = Column(Integer, primary_key=True)
    date = Column(Date)
    name = Column(String)
    units = Column(String)
    value = Column(Float)
    MRN = Column(Integer, ForeignKey('patients.MRN'))
    
    patient = relationship('Patient', back_populates='lab_values')
    
    def __repr__(self):
        return f'<LabValue ({self.date}, {self.name}, {self.value} {self.units})>'
		
	
class Hospitalization(Base):
    __tablename__ = 'hospitalizations'
    
    id = Column(Integer, primary_key=True)
    date = Column(Date)
    hospital_name = Column(String)
    admit_reason = Column(String)
    length_of_stay = Column(Integer)
    MRN = Column(Integer, ForeignKey('patients.MRN'))
    
    patient = relationship('Patient', back_populates='hospitalizations') 
EDIT: And if it was not clear this is my first time working with databases, so what I have here might be completely wrong.

Jose Cuervo
Aug 25, 2004
To be clear, this is a research data set which lives on a HIPAA compliant server. It is currently a bunch of excel files that were pulled by the IT team who run the health system databases. I would like to turn the excel files into a database so that things are much more organized and I can learn about databases. Hence why I was asking for help.


Da Mott Man posted:

Its been a while working with sqlalchemy but if I remember correctly your id columns should have autoincrement=True and for speed of lookups you should have an index=True for any column you would use to select records.
You might also want to cascade delete child data if the patent is deleted for some reason.
I would also set the dateofbirth column to a DateTime instead of a string, it would be easier to query for patents of specific age or range of ages.

I have changed the column to DateTime, that was a typo.
Thanks for the keyword 'cascade' - that allowed me to look up what I wanted. I think the new definition of Patient with cascade='all, delete' will accomplish the desired behavior, right?

Python code:
class Patient(Base):
    __tablename__ = 'patients'

    MRN = Column(Integer, primary_key=True)
    first_name = Column(String)
    last_name = Column(String)
    date_of_birth = Column(Date) 

    diabetic = relationship("Diabetic", back_populates='patient', cascade='all, delete')
    lab_values = relationship('LabValues', back_populates='patient', cascade='all, delete')
    hospitalizations = relationship('Hospitalization', back_populates='patient', cascade='all, delete')

    def __repr__(self):
        return f'<Patient({self.last_name}, {self.first_name}' \
               f' ({self.MRN}, {self.date_of_birth})>'

Hollow Talk posted:

I'm not a fan of pro-forma integer id columns. I see these columns often, and they are useless more often than not (auto-incrementing columns are only really useful if you want to check for holes in your sequence, i.e. for deleted data, and that can be solved differently as well).

Patient already has a primary key, so the id column is superfluous. For the other tables, keys depend on how often you expect to load new data. If this only gets new data once a day, just make the combination of MRN + date the table's primary key (i.e. a compound key) -- SQLAlchemy lets you set multiple columns as primary keys, and it just makes a combined key out of it. If you update more frequently than daily, I would change date to a timestamp, and use that. Realistically, depending on your analysis needs, you will probably end up joining and either selecting the full history for a single patient, or the current status for all patients (or a subset thereof), which this would cover either way. For LabValues, you might need MRN + date + name (assuming name is whatever was tested in the lab).

Essentially, I am of the opinion that primary keys should have meaning on their own, which sequences (i.e. what SQLAlchemy will most likely create under the hood if you have an auto-incrementing integer column and if the underlying database supports sequences) most often do not.

Edit: This is on top what has already been said. Also, foreign keys look good to me.

As I elaborated above I am trying place the data stored in Excel files into a database. I expect at some point in the future to have more data to add to the database. The data timestamps are only specific to the date (i.e., I only know that a lab value arrived on 2020/03/21, and not at a specific time on that date).

Would the following change to the Diabetic class be along the lines of what you are suggesting?

Python code:
class Diabetic(Base):
    __tablename__ = 'diabetic'
    
    MRN = Column(Integer, ForeignKey('patients.MRN'), primary_key=True)
    date = Column(Date, primary_key=True)
    diabetic = Column(Boolean)    
    
    patient = relationship('Patient', back_populates='diabetic') 
Can you explain what advantage there is to having the compound key?

M. Night Skymall posted:

You should probably not use the MRN as the identifier throughout your database. It's generally best practice with PHI to isolate out the MRN into a de-identification table and use an identifier unique to your application to identify the patient. It's easy enough to do a join or whatever if someone wants to look people up by MRN, but it's useful to be able to display some kind of unique identifier that isn't immediately under all the HIPAA restrictions as PHI. Even if it's as simple as someone trying to do a bug report and not having to deal with the fact that their bug report must contain PHI to tell you which patient caused it. Not vomiting out PHI 100% of the time in error messages, things like that. Just make some other ID in the patient table and use that as the foreign key elsewhere.

Would you have this same concern given the use case (a research database where the only people which access to it are on the IRB protocol - currently just me for now)?

Jose Cuervo
Aug 25, 2004

M. Night Skymall posted:

It doesn't matter what the data is used for. The history of MRN as PHI is pretty dumb in my opinion, but per guidance from the government your MRN is as much PHI as your DOB or name. Having a de-identification table isn't a big deal, you can still store all your PHI in the patients table along with your new unique identifier. It's really *just* to remove the awkwardness of having everything in your DB keyed to a piece of PHI. I mean you're right, it's just you and it probably won't affect much now. But making good decisions about your schema is much..much easier now than it is later, and there's basically no way you will live to regret de-identifying your data in advance, and many ways you can live to regret spreading the MRN all over your database.

Anyone who's looking at this going "No no not python!" is just forcing this data to be sucked into a PowerBI tool instead. Don't make this poor person use PowerBI, that's just mean.

Edit: MRN is problematic because it's specifically listed as an identifier. If you have an internal ID in use for patients besides MRN that'd also be better than using the MRN directly. One more dumb thing about MRNs, they're specific to the hospital EMR so they aren't guaranteed to be unique.

I am on board with doing this, but I am having trouble wrapping my head around how I would accomplish this. Do I define a new column, say pid as the primary_key for the Patient model as below, and use pid as the foreign_key in the other models as below for the Diabetic model?

Python code:
class Patient(Base):
    __tablename__ = 'patients'

    pid = Column(Integer, primary_key=True)
    mrn = Column(Integer)
    first_name = Column(String)
    last_name = Column(String)
    date_of_birth = Column(Date) 

    diabetic = relationship('Diabetic', 
                            back_populates='patient', 
                            cascade='delete, all')


class Diabetic(Base):
    __tablename__ = 'diabetic'
    
    pid = Column(Integer, ForeignKey('patients.pid'), primary_key=True)
    date = Column(Date, primary_key=True)
    diabetic = Column(Boolean)  
    
    patient = relationship('Patient', back_populates='diabetic')   
However, if I do this, then I need to keep a mapping between the MRN and the pid outside of the database, right? So now I have the MRN in two different places...

Adbot
ADBOT LOVES YOU

Jose Cuervo
Aug 25, 2004

Hollow Talk posted:

Yes, that would be it.

Essentially, the reason why I dislike auto-increment integer columns is that they serve no particular purpose, and they provide no guarantees -- if you want to enforce constraints (i.e. to avoid duplicate data if you run your tool against the Excel files more than once), auto-incrementing columns don't help, because the only constraint for the database is that the primary key is unique (which a sequence ensures), plus any foreign key relations if you specify them. If you insert the same data again, the original option will happily do so, given this data:

Python code:
with engine.begin() as connection:
    _patient = {
        "MRN": 1,
        "first_name": "Hollow",
        "last_name": "Talk",
        "date_of_birth": "1970-01-01",
    }
    _diabetic = [
        {"MRN": 1, "date": datetime.date(2021, 3, 23), "diabetic": False},
        {"MRN": 1, "date": datetime.date(2021, 3, 23), "diabetic": False},
    ]

    connection.execute(insert(Patient).values(_patient))
    connection.execute(insert(Diabetic).values(_diabetic))
If you define an auto-increment column, this will insert even though the diabetic data row is duplicated. If you instead define your primary keys according to your actual uniqueness logic, python will throw an exception for the above duplicate:

Bash code:
sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: diabetic.MRN, diabetic.date
[SQL: INSERT INTO diabetic ("MRN", date, diabetic) VALUES (?, ?, ?), (?, ?, ?)]
[parameters: (1, '2021-03-23', 0, 1, '2021-03-23', 0)]
(Background on this error at: [url]http://sqlalche.me/e/14/gkpj[/url])
In most cases*, I prefer my inserts to fail as early as possible when constraints are violated, so that I can fix my insert/select logic, rather than clean up duplicate and/or otherwise broken data afterwards. This lets you express what can and cannot be inserted more than once, and you don't add any technical columns that don't really serve any analytical purpose (nor any other purpose, really). For this use case, using the auto-increment column as primary key and not using any primary key are functionally identical.

Unrelated: If you can avoid it, try to stick to lowercase column names, so use mrn instead of MRN. Many databases don't care, but for those cases where they do, this saves you from lots of quoting (or general incompatibility).

*: This does not apply for data warehousing, where constraints might not be feasible. Or if you use something like Amazon Redshift, because "we'll just ignore all constraints because that will make things faster" is certainly one way to speed up inserts.

Got it. The case of trying to insert data which already exists in the database is something I was thinking about, and the way you suggested defining the compound key neatly solves that issue.

As a follow-on question, would you wrap the insert statement in a try-except block, and catch the "sqlalchemy.exc.IntegrityError" exception to deal with not crashing when trying to insert data which is already in the database?

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply