|
I am trying to understand OrderedDicts. I am reading the documentation and am having trouble understanding what the following line is saying: "The OrderedDict constructor and update() method both accept keyword arguments, but their order is lost because Python’s function call semantics pass-in keyword arguments using a regular unordered dictionary." Is this saying that if I add items to the OrderedDict as follows Python code:
code:
|
# ¿ Mar 14, 2017 21:01 |
|
|
# ¿ May 2, 2024 01:10 |
|
Got it, thanks!
|
# ¿ Mar 14, 2017 21:29 |
|
The Gunslinger posted:I just uninstalled Python and installed Anaconda instead. Having the same problem though. I think you have a file named turtle.py in the same folder as the file bob.py. So when it looks to find the turtle module, it "finds' it where it first looks - in the same folder, and never gets to look in the folder where Python is installed.
|
# ¿ Mar 21, 2017 13:34 |
|
I am using Python 2.7 with PyCharm Community Edition 2016.3.2. I have the following snippet of code:Python code:
|
# ¿ Mar 22, 2017 15:29 |
|
QuarkJets posted:Try using a comma instead of a colon in the typehinting Forgot to say that this is what the problem was. Thanks!
|
# ¿ Mar 27, 2017 15:06 |
|
Eela6 posted:Why not? I'd love to know Me too.
|
# ¿ Mar 30, 2017 13:44 |
|
FingersMaloy posted:I'm still trying to make this scraper work. I've abandoned BeautifulSoup and totally committed to Scrapy, my spider works but I can't make it pull the exact pieces I need. I'm using this code as my guide, but it's not fully working: The error message is telling you that you are trying to concatenate a string and a list - apparently item['link'] is a list. I think your error is your for statement ("for titleS in titles:"), where I think you actually want to say "for title in titleS:", and then change occurrences of titleS in the for loop to title.
|
# ¿ Apr 3, 2017 20:49 |
|
I have read a large amount of data into a Pandas DataFrame from multiple .csv files (where each .csv file contained the same data but for different years). There are columns corresponding to test scores in the DataFrame which should all be numbers, however based on a df[col_name].unique() call it is clear that some files stored the numbers in the columns as numbers, some as strings, and sometimes when there wasn't a test score text was used ('No Score'). Because of this, I cannot use the .astype() functionality to convert the column to floats. What would be a good way to go about identifying the entries that are stored as strings and converting them to floats while at the same time identifying the entries that are text and replacing those entries with nan?
|
# ¿ Apr 28, 2017 03:23 |
|
vikingstrike posted:
Or even simpler: code:
|
# ¿ Aug 15, 2017 03:34 |
|
I wrote a small data processing script a while back and since then I have updated pandas. Now the script does not work. Short of creating a venv for a single script, is there any way of telling my system to use a specific version of pandas when running the script?
|
# ¿ Nov 20, 2017 16:59 |
|
My use case is that I need to rerun a bit of analysis for a paper in response to a review that requires something slightly different or with a updated set of data. I don't think it is worth my time to fix the script that I might have written 6 months ago for a one time run, so it seems like the venv is the way to go.
|
# ¿ Nov 21, 2017 15:20 |
|
Seventh Arrow posted:What I'm trying to say is that each folium line can't keep reading the first row over and over again. The first folium line needs to use row 1, the second one needs to use row 2, and so on. If your dataframe has 'Lat', 'Long' and 'Description' columns, then I think this is what you might be looking for: Python code:
|
# ¿ Jan 10, 2018 16:01 |
|
Fair enough - but how would you vectorize what Seventh Arrow wants to do? Edit: Or are you saying use something like: Python code:
Jose Cuervo fucked around with this message at 17:45 on Jan 10, 2018 |
# ¿ Jan 10, 2018 17:41 |
|
Cingulate posted:If the thing you need to do is indeed to go through the data line by line, and for each line, run the Marker thing on that line's values, then you could indeed do what I'm suggesting here: I don't know that the code you have works. Running the following (I think equivalent) code results in 'ValueError: need more than 1 value to unpack' Python code:
|
# ¿ Jan 10, 2018 19:16 |
|
Seventh Arrow posted:Ok I think I'm almost there. I've managed to twist its arm enough that it will post the map marker with a colour...once. So I think I'm in the ballpark, but I get a map with only one marker. Here's what I have so far: You need to place the line folium.Marker(position, icon=folium.Icon(color=cl)).add_to(map_1) inside the for loop (so basically tab it over once). Right now the only thing being added to the map is the last location. EDIT: Also it looks like you chose to use the iterrows() solution that I proposed, but as pointed out that is probably the slowest way to do things, and you should probably use one of the other suggested methods. Jose Cuervo fucked around with this message at 02:20 on Jan 24, 2018 |
# ¿ Jan 24, 2018 02:17 |
|
Wallet posted:I have another potentially stupid question that I can hopefully explain in something resembling a comprehensible fashion: You could generate a single list of all the numbers, then iterate through the original lines, and for each word on each of the original lines, pop a number from the single list to attach to the end of the word. Edit: Something along these lines: Python code:
Jose Cuervo fucked around with this message at 15:01 on Jan 26, 2018 |
# ¿ Jan 26, 2018 14:57 |
|
I am trying to understand what I am doing wrong with a Pandas merge of two data frames:Python code:
|
# ¿ Feb 3, 2018 04:09 |
|
vikingstrike posted:Is df_two unique on ID, Date? Are you asking if all the (ID, Date) tuple combinations in df_two are unique? If so, yes. df_two was generated using groupby where on=['ID', 'Date'].
|
# ¿ Feb 3, 2018 15:52 |
|
vikingstrike posted:Er, sorry. Unique was a bad word. In df_two does (ID, Date) ever include multiple rows? No, because df_two was generated using a groupby statement where on=['ID', 'Date'] - so every row in df_two corresponds to a unique ('ID', 'Date') tuple.
|
# ¿ Feb 3, 2018 16:49 |
|
vikingstrike posted:Try the indicator flag on the merge and then use it to see if it might lead you to where the extra rows are coming. Just wanted to say you got me thinking about having duplicates, and df_one actually had non-unique ('ID', 'Date') tuples, so that is where the problem was. Thanks for the help.
|
# ¿ Feb 5, 2018 20:20 |
|
Here is my folder structure:code:
Python code:
|
# ¿ Mar 14, 2018 19:08 |
|
Dr Subterfuge posted:You'll want to put an __init__.py everywhere so can import your folders as packages Thanks. Am I correct in thinking that with the following structure and __init__.py placement: code:
Python code:
Python code:
But with the following structure and __init__.py placement: code:
Python code:
|
# ¿ Mar 14, 2018 21:32 |
|
Boris Galerkin posted:Helpful explanation. Got it. Dr Subterfuge's suggestion of adding the scripts folder to the path does make what I posted work, and seems much simpler than having to rename the scripts folder.
|
# ¿ Mar 15, 2018 19:27 |
|
I have a Jupyter notebook where I have a set of 3 subplots which share an x-axis. The x-axis is time in weeks, and I would like to make the plot scrollable (ideally) with two buttons at the bottom - a forward button which increases the upper and lower bound on the x-axis by 4 weeks, and backward button which decreases the upper and lower bound on the x-axis by 4 weeks. This is the code i use right now to generate a static image in the notebook cell: Python code:
|
# ¿ Jul 10, 2018 19:07 |
|
I taught a class for a colleague last week that used Jupyter notebooks. Each student was told to download the notebook from the class website, and then we went through the notebook in class and the students had cells where they had to type their own code. One of the issues I encountered when plotting using seaborn was that not everyone had the exact same plot - i.e., the scatter plot matrix looked slightly different between students (the underlying shape etc was correct, but the presentation was different). I think this came down to the fact that not everyone was using Python 3 like I was, and not everyone had the same version of seaborn installed. Another issue was pandas.cut() worked slightly differently for everyone because of changes between versions. Question: Is there a standard / simple way in the Notebook to ensure that everyone uses the same version of Python, and the same version of the packages being imported?
|
# ¿ Sep 26, 2018 21:01 |
|
Symbolic Butt posted:Installing Anaconda seems to be the best solution for this in my experience. What does this process look like? Get everyone in the class to download Anaconda on day one of the class, then...? OnceIWasAnOstrich posted:Run a JupyterHub server and set everyone to use one consistent environment? The documentation for this looks like it is tailor made for this, thanks.
|
# ¿ Sep 27, 2018 16:51 |
|
Not sure if this is the right place to ask this, but I want to generate a time series which simulates consumption of a product. One way of doing this is to assume that the consumption has a particular distribution, say Triangular(lower=2, mode=4, upper=5), and then at each time step draw a random variate from that distribution to simulate the amount consumed during that time step. However, generating the time series in this way does not produce any correlation in consumption between successive time steps. That is, if the consumption of the product was on the higher end of the distribution at time t, then the consumption of the product at time t+1 should likely be on the higher end of the distribution as well, and vice versa. How would I go about adding the correlation aspect to the simulated time series?
|
# ¿ Oct 26, 2018 20:34 |
|
CarForumPoster posted:If you’re generating based on random variables (0,1) that follow a triangular CDF, you could make an if/then that looks at the random variable from t-1 and keeps generating numbers until the random number is within x distance/percentage/etc of the previous Morek posted:What is the usage scale you're actually trying to simulate? Hundreds of use sessions, millions, billions? Spime Wrangler posted:You don't really give enough info to specify the problem. Does the triangle distribution describe instantaneous total consumption at time t, or does it describe inter-arrival times of consumers, or per-instance time-required-to-consume? Sorry for not providing enough information. The project involves simulating the amount of fuel consumed at a forward operating base each day. During periods of intense activity there are several or more days of high daily consumption, while during lull periods there are stretches of several days of low daily consumption. Historical consumption information is classified information, so I am have to generate a consumption time series on my own. The information I can get from the subject matter experts is a minimum, maximum, and mode for the amount of fuel consumed per day, which is why I mentioned the Triangular distribution (although I suppose I could also use a Beta distribution). Right now a generate a time series of the daily fuel consumption (180 days of daily consumption) by drawing a random variate from a Triangular(minimum, mode, maximum) distribution 180 times. However, generating a time series in this way does not, in general, result in the stretches of correlated activity that would be in the real time series. And I was trying to figure out a way to achieve this correlation, while still being able to say that the underlying distribution of the daily consumption is Triangular(minimum, mode, maximum).
|
# ¿ Oct 29, 2018 14:36 |
|
I am trying to compute a kernel density estimate for the rate of incidents around the state of Virginia, similar to what is done here. My previous attempts with using 'euclidean' as the metric produces this KDE plot: I then realised that I should be using 'haversine' as the metric because I have a two dimensional vector space as described here. I have had to modify the code from the Python Data Science Handbook example because I am using geopandas to plot the state map instead of the matplotlib basemap, and my data is in a dataframe and not numpy arrays like the data seems to be in the linked example. Here is my code: Python code:
There are about 8000 points plotted on the map with some well defined clusters of points. Unfortunately the KDE plot does not seem to be picking up on them as it does in the linked example. I have tried playing with the bandwidth for the KDE but this does not seem to change things for the better - for example when the bandwidth is set to 0.3 this is the resulting KDE plot: I was expecting the KDE using 'haversine' as the metric to be slightly different to the original KDE plot, but still somewhat similar. I think that the KDE plots I have now are incorrect but I cannot tell what I am doing wrong. Thoughts?
|
# ¿ Dec 7, 2018 19:54 |
|
bob dobbs is dead posted:eucliean distance metric is fine unless you're using the map for navigation tho? From the Python Data Science Handbook it seemed to say that you should use 'haversine' when performing KDE where the points are latitude and longitude - that is why I went from using 'euclidean' to 'haversine'. By 'fine' do you mean the error in the distances between points (because the distances will not be great circle distances but just straight line distances) is small enough to be ignored if you are not trying to navigate between points? I am not sure what the '2-space is within the set of n-spaces, yes' comment means exactly. The Longitude minimum and maximum are -83.6311 and -75.3771, while the Latitude maximum and minimum are 36.5454 and 39.4172. I believe they are in units of decimal degrees. Is this what you were asking?
|
# ¿ Dec 7, 2018 20:26 |
|
bob dobbs is dead posted:what i thought that you were thinking is that "oh, you can't use euclidean at all for this", but the real statement to make is "euclidean will introduce distortions in this. but if you're overlaying it over a flat projection in the first place, the distortions will basically look like the map distortions". the other possible misunderstanding is just because euclidean distance is viable for any dimensional space (n-spaces) you think you can't just use it for 2-dimensional space (2-space) Thanks for the clarification regarding the 'euclidean' versus 'haversine' issue. I am using radians though: in the code I posted I convert the latlon values to radians using numpy.radians() and I do the same for the sample points with the line Python code:
EDIT: But since it seems reasonable to use 'euclidean' I will just use that and not worry about why this is not working. Edit2: Does posting on the forums count as duck debugging? I noticed that I had the sample points as (long, lat) pairs, not (lat, long) pairs, i.e. the line Python code:
Python code:
Jose Cuervo fucked around with this message at 20:40 on Dec 7, 2018 |
# ¿ Dec 7, 2018 20:34 |
|
I am trying to parallelize some code I have. I have a dictionary with 5500 entries (but this number will grow to about 20000 entries later on), where each entry is a pandas series that is 288 rows in length. I have to run a function Y on each combination of entries in the dictionary, where Y computes the euclidean distance between the two series and is a fairly computationally inexpensive task. However, even with 5500 entries this comes out to approximately 15 million calls to function Y and I would like to parallelize it. I am looking into using joblib to make use of all 6 cores on my computer and I am wondering if 1. The best way to do this is to split the list of combinations into 6 approximately equal sublists, and then use joblib to run the computation for each list on a single core, and 2. Is it possible to have a single copy of the dictionary that can be accessed by each process, rather than having six copies of this large dictionary?
|
# ¿ Apr 9, 2020 21:54 |
|
I have a set of 4 Jupyter Notebooks which, when run one after the other, act like a pipeline (i.e., the output from the end of the first notebook is read in by the second notebook, and so on). Is there a way to write a python script to call each notebook in order, while waiting for each notebook to finish before starting the next? EDIT: After some searching I just found papermill which it looks like will do exactly what I want. Jose Cuervo fucked around with this message at 19:02 on Jun 1, 2020 |
# ¿ Jun 1, 2020 18:40 |
|
I have a list of timestamps which represent the number of seconds since January 1st 2008. I would like to convert these into a pandas series of date times (e.g., 2020-09-27 22:13:56). Looking through stack overflow I have found this question which uses the datetime library to convert a timestamp in seconds to a "readable" timestamp. However the datetime.datetime.fromtimestamp() function seems to be defined as seconds since January 1st 1970. Is converting my timestamp using the datetime.datetime.fromtimestamp() as is, and then adding on the number of days between Jan 1st 1970 and Jan 1st 2008 (via pandas.DateOffset(days=num_days) where num_days is the number of days between the two dates) the best way to go about this?
|
# ¿ Sep 28, 2020 17:16 |
|
Bad Munki posted:Logically, it may be clearer to find the offset of your epoch from the standard epoch and add that number of seconds to the times you have, and then convert THAT to a standard datetime object, just because then you're not calculating a known-to-be-wrong datetime at any point, but tomato tomato. Yep, that makes more sense. Will do that.
|
# ¿ Sep 28, 2020 18:10 |
|
QuarkJets posted:Pandas can generate timestamps from a series of integers, using whatever epoch you want, interpreting them in whatever units you want (the default is nanoseconds) Ah, excellent. That is even simpler than what I was proposing to do. Thanks for pointing this out - I will change this part of my code from what I was doing.
|
# ¿ Oct 1, 2020 14:51 |
|
I want to build what I think is a pretty simple database using SQLite and SQLAlchemy. The database is going to hold a subset of data from a different database where each patient already has a unique numeric identifier (Medical Record Number, MRN). The database is intended to hold information relevant to someone undergoing dialysis. In addition to the usual demographic information the patient will also have 1. A diabetic status (which can change over time from No to Yes) 2. Lab values (these patients get labs run at least once a month) 3. Hospitalizations Below is what I have started to code up. In particular, I want to know if 1. I have defined the primary key in the Patient class correctly (I want to use the MRN as the primary key because data being added to the database can be uniquely identified as belonging to a patient with that MRN), 2. I have defined the relationships correctly, 3. I have defined the Foreign Keys correctly, and 4. What I need to add so that when a particular patient is removed from the 'patients' table (see below), all their associated data from the diabetic and hospitalizations tables is also removed. Python code:
|
# ¿ Mar 23, 2021 03:18 |
|
To be clear, this is a research data set which lives on a HIPAA compliant server. It is currently a bunch of excel files that were pulled by the IT team who run the health system databases. I would like to turn the excel files into a database so that things are much more organized and I can learn about databases. Hence why I was asking for help.Da Mott Man posted:Its been a while working with sqlalchemy but if I remember correctly your id columns should have autoincrement=True and for speed of lookups you should have an index=True for any column you would use to select records. I have changed the column to DateTime, that was a typo. Thanks for the keyword 'cascade' - that allowed me to look up what I wanted. I think the new definition of Patient with cascade='all, delete' will accomplish the desired behavior, right? Python code:
Hollow Talk posted:I'm not a fan of pro-forma integer id columns. I see these columns often, and they are useless more often than not (auto-incrementing columns are only really useful if you want to check for holes in your sequence, i.e. for deleted data, and that can be solved differently as well). As I elaborated above I am trying place the data stored in Excel files into a database. I expect at some point in the future to have more data to add to the database. The data timestamps are only specific to the date (i.e., I only know that a lab value arrived on 2020/03/21, and not at a specific time on that date). Would the following change to the Diabetic class be along the lines of what you are suggesting? Python code:
M. Night Skymall posted:You should probably not use the MRN as the identifier throughout your database. It's generally best practice with PHI to isolate out the MRN into a de-identification table and use an identifier unique to your application to identify the patient. It's easy enough to do a join or whatever if someone wants to look people up by MRN, but it's useful to be able to display some kind of unique identifier that isn't immediately under all the HIPAA restrictions as PHI. Even if it's as simple as someone trying to do a bug report and not having to deal with the fact that their bug report must contain PHI to tell you which patient caused it. Not vomiting out PHI 100% of the time in error messages, things like that. Just make some other ID in the patient table and use that as the foreign key elsewhere. Would you have this same concern given the use case (a research database where the only people which access to it are on the IRB protocol - currently just me for now)?
|
# ¿ Mar 23, 2021 17:46 |
|
M. Night Skymall posted:It doesn't matter what the data is used for. The history of MRN as PHI is pretty dumb in my opinion, but per guidance from the government your MRN is as much PHI as your DOB or name. Having a de-identification table isn't a big deal, you can still store all your PHI in the patients table along with your new unique identifier. It's really *just* to remove the awkwardness of having everything in your DB keyed to a piece of PHI. I mean you're right, it's just you and it probably won't affect much now. But making good decisions about your schema is much..much easier now than it is later, and there's basically no way you will live to regret de-identifying your data in advance, and many ways you can live to regret spreading the MRN all over your database. I am on board with doing this, but I am having trouble wrapping my head around how I would accomplish this. Do I define a new column, say pid as the primary_key for the Patient model as below, and use pid as the foreign_key in the other models as below for the Diabetic model? Python code:
|
# ¿ Mar 23, 2021 20:03 |
|
|
# ¿ May 2, 2024 01:10 |
|
Hollow Talk posted:Yes, that would be it. Got it. The case of trying to insert data which already exists in the database is something I was thinking about, and the way you suggested defining the compound key neatly solves that issue. As a follow-on question, would you wrap the insert statement in a try-except block, and catch the "sqlalchemy.exc.IntegrityError" exception to deal with not crashing when trying to insert data which is already in the database?
|
# ¿ Mar 23, 2021 20:10 |