Python

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

«‹›230 »

Hollow Talk: Feb 2, 2014

Bundy posted:

Use sqlite or something god drat pandas has become the new excel

With roughly the same quality of code produced. :v:

# ? Mar 13, 2021 12:11

Adbot: ADBOT LOVES YOU

# ? May 15, 2024 11:44

NinpoEspiritoSanto: Oct 22, 2013

Hollow Talk posted:

With roughly the same quality of code produced.

Can confirm lol

# ? Mar 13, 2021 14:36

Dominoes: Sep 20, 2007

I don't know how feasible this is for your use case, but I agree - Pandas can be a performance bottleneck. You're working with strings instead of numbers, but when working with numbers, using numpy arrays (which IIRC it wraps) is OOMs faster.

# ? Mar 13, 2021 17:43

Data Graham: Dec 28, 2009; 📈📊🍪😋

mods change username to steak overflow

# ? Mar 13, 2021 21:34

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

Yea a db is the way to go though I also need to do fuzzy string matching which is a whole other issue but I suppose if I wanna do that I can query all 8M rows of 1 or 2 cols into memory the do my fuzzy matching in python slow as loving can be. Maybe make a AWS Lambda function to parallelize a bunch of these but have the

Hey so new dumb question that has gently caress all to do with python except that I'll interface with it using python...can I have a list of foreign keys in a relational DB column (or rather, in one element in the column)?

For example say I have a table of businesses with each row a new business' attributes and a table of addresses. Can I have a column in the businesses table that has multiple references to one address? Or would I not make it a foreign key and just make it a list because thats the point of lists.

# ? Mar 14, 2021 01:42

KICK BAMA KICK: Mar 2, 2009

CarForumPoster posted:

Yea a db is the way to go though I also need to do fuzzy string matching which is a whole other issue but I suppose if I wanna do that I can query all 8M rows of 1 or 2 cols into memory the do my fuzzy matching in python slow as loving can be. Maybe make a AWS Lambda function to parallelize a bunch of these but have the

I was messing around with some fuzzy string search a while ago and tried fuzzwuzz but got faster and better results for my case just using the Postgres full-text search capability (via Django). The only thing "fuzzy" about my case was that both the query and the stored texts were OCRed and often disagreed by a few characters even from the exact same canonical text so idk if that's useful.

# ? Mar 14, 2021 01:46

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

KICK BAMA KICK posted:

I was messing around with some fuzzy string search a while ago and tried fuzzwuzz but got faster and better results for my case just using the Postgres full-text search capability (via Django). The only thing "fuzzy" about my case was that both the query and the stored texts were OCRed and often disagreed by a few characters even from the exact same canonical text so idk if that's useful.

Looking into this now but fuzzwuzzy was the "it takes too fuckin long plan". I this case I need to find addresses that fuzzywuzzy would generally return as a 0.85 or better match.

For example:

I might query something like the USPS standardized address line 1:

123 GOONDOLANCES LN

And these might be returned by fuzzywuzzy with 0.85 or better:

123 GONDOLANCES LN
123 GOONDOLANCES LANE
453 GOONDOLANCES LANE
125 GOON DOLANCES LN

There's more to match than addresses (I know I could do something like a regex for the street no that filters to a much shorter list before fuzzy matching if I really needed to do that) but they are all strings.

# ? Mar 14, 2021 02:07

QuarkJets: Sep 8, 2008

CarForumPoster posted:

Yea a db is the way to go though I also need to do fuzzy string matching which is a whole other issue but I suppose if I wanna do that I can query all 8M rows of 1 or 2 cols into memory the do my fuzzy matching in python slow as loving can be. Maybe make a AWS Lambda function to parallelize a bunch of these but have the

Hey so new dumb question that has gently caress all to do with python except that I'll interface with it using python...can I have a list of foreign keys in a relational DB column (or rather, in one element in the column)?

For example say I have a table of businesses with each row a new business' attributes and a table of addresses. Can I have a column in the businesses table that has multiple references to one address? Or would I not make it a foreign key and just make it a list because thats the point of lists.

You want to store multiple foreign keys in a single column?

You can implement this by defining a new table that just stores those foreign keys with a common id, then your original table just stores that 1 id.

# ? Mar 14, 2021 02:34

Foxfire_: Nov 8, 2010

CarForumPoster posted:

Yea a db is the way to go though I also need to do fuzzy string matching which is a whole other issue but I suppose if I wanna do that I can query all 8M rows of 1 or 2 cols into memory the do my fuzzy matching in python slow as loving can be. Maybe make a AWS Lambda function to parallelize a bunch of these but have the

Your underlying problem is more that pandas in general is trying to optimize for person-time writing code at the expense of being very slow and using lots of RAM. Generally that's a good assumption, but not here if you have too much stuff and are going to run it a lot. pandas is also horrible at strings since the underlying actual things being stored is arrays of PyObject pointers to full python objects elsewhere.

8,000,000 rows x 50 cols isn't that much data. Like if each string is 64bytes, that's still only about 25GB, which is only going to take a second or two to go through if they're already in RAM.

I would:
- Move data to plain numpy arrays of string_ dtype (fixed length, not python objects)
- Do the fuzzy string match in numba. Levenshtein distance implementations are easily googleable
Ought to take less than a second to compute distance between the query and every element in a column

Wouldn't use a database since you won't be able to do the fuzzy matching on the database side, and you want to avoid having to construct python objects or run python code per element.

# ? Mar 14, 2021 03:17

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

Foxfire_ posted:

Your underlying problem is more that pandas in general is trying to optimize for person-time writing code at the expense of being very slow and using lots of RAM. Generally that's a good assumption, but not here if you have too much stuff and are going to run it a lot. pandas is also horrible at strings since the underlying actual things being stored is arrays of PyObject pointers to full python objects elsewhere.

8,000,000 rows x 50 cols isn't that much data. Like if each string is 64bytes, that's still only about 25GB, which is only going to take a second or two to go through if they're already in RAM.

I would:
- Move data to plain numpy arrays of string_ dtype (fixed length, not python objects)
- Do the fuzzy string match in numba. Levenshtein distance implementations are easily googleable
Ought to take less than a second to compute distance between the query and every element in a column

Wouldn't use a database since you won't be able to do the fuzzy matching on the database side, and you want to avoid having to construct python objects or run python code per element.

Yea that sounds like a good plan and def hadnt through of that. Ill do some googlin this week. This data only gets updates like once per year so I could easily save a pickle of it.

These are instance where I'll need to see if one string of these 8M rows exist in another set of 7M rows with 4 columns it could be in in those rows, but if the numpy/numba solution works then we're all good. Doesnt have to be blazing fast just not 5-10 minutes per answer.

KICK BAMA KICK posted:

I was messing around with some fuzzy string search a while ago and tried fuzzwuzz but got faster and better results for my case just using the Postgres full-text search capability (via Django). The only thing "fuzzy" about my case was that both the query and the stored texts were OCRed and often disagreed by a few characters even from the exact same canonical text so idk if that's useful.

Django even has trigram similarity & distance. I might give this a shot because it it works it basically solves my problem. Bonus that I'm likely to use Django for deploying parts of this anyway.

CarForumPoster fucked around with this message at 03:26 on Mar 14, 2021

# ? Mar 14, 2021 03:21

Dominoes: Sep 20, 2007

Foxfire_ posted:

Your underlying problem is more that pandas in general is trying to optimize for person-time writing code at the expense of being very slow and using lots of RAM. Generally that's a good assumption, but not here if you have too much stuff and are going to run it a lot. pandas is also horrible at strings since the underlying actual things being stored is arrays of PyObject pointers to full python objects elsewhere.

8,000,000 rows x 50 cols isn't that much data. Like if each string is 64bytes, that's still only about 25GB, which is only going to take a second or two to go through if they're already in RAM.

I would:
- Move data to plain numpy arrays of string_ dtype (fixed length, not python objects)
- Do the fuzzy string match in numba. Levenshtein distance implementations are easily googleable
Ought to take less than a second to compute distance between the query and every element in a column

Wouldn't use a database since you won't be able to do the fuzzy matching on the database side, and you want to avoid having to construct python objects or run python code per element.

Agree. Now that I'm in a mood to ignore diplomacy: Pandas blows. The devs think we live in a post-computational-scarcity world. They're wrong.

# ? Mar 14, 2021 06:04

QuarkJets: Sep 8, 2008

lol yeah I really want to use pandas in a project some day but its special niche capabilities just never seems to be worth the performance penalty for me

# ? Mar 14, 2021 07:23

punished milkman: Dec 5, 2018; would have won

CarForumPoster posted:

Django even has trigram similarity & distance. I might give this a shot because it it works it basically solves my problem. Bonus that I'm likely to use Django for deploying parts of this anyway.

I�ve done the trigram search stuff with Django in the past and it�s pretty neat. Note that it�s only supported if you�re using PostgreSQL, and you need to manually make a migration that specifies you�re using the trigram extension. Once you�ve got it set up it�s dead simple to use with the ORM.

# ? Mar 14, 2021 19:08

Zoracle Zed: Jul 10, 2001

Bundy posted:

Use sqlite or something god drat pandas has become the new excel

I am somehow angry at how true this is

# ? Mar 14, 2021 19:20

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

punished milkman posted:

I�ve done the trigram search stuff with Django in the past and it�s pretty neat. Note that it�s only supported if you�re using PostgreSQL, and you need to manually make a migration that specifies you�re using the trigram extension. Once you�ve got it set up it�s dead simple to use with the ORM.

Good to know. I'd be sticking it on RDS so Postgres should be fine.

# ? Mar 14, 2021 21:48

NinpoEspiritoSanto: Oct 22, 2013

punished milkman posted:

I’ve done the trigram search stuff with Django in the past and it’s pretty neat. Note that it’s only supported if you’re using PostgreSQL, and you need to manually make a migration that specifies you’re using the trigram extension. Once you’ve got it set up it’s dead simple to use with the ORM.

Would like to thank those in this conversation because holy poo poo I didn't know this was a thing this owns and has helped me make a "which framework" decision.

Postgres always a nobrainer I wish mysql and its ilks would loving die

Signed, paid mysql dba among other things

# ? Mar 14, 2021 23:12

Da Mott Man: Aug 3, 2012

salisbury shake posted:

Turns out they're only called upon SIGINT or during an otherwise clean shutdown.

Don't know if you figured out a solution for this, but https://docs.python.org/3.8/library/signal.html#signal.signal is what I've used in the past to trigger cleanups.

# ? Mar 15, 2021 08:08

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

Pandas: I learned that regex=False for .str.contains is much faster. Even filtering on two values. For example:

code:


df[(df["BorrowerName"].str.contains("COMPANY", regex=False)) & (df["BorrowerName"].str.contains("INC", regex=False))]

3.03s


df[(df["BorrowerName"].str.contains("^COMPANY(.*)INC", regex=True))]

5.19s

# ? Mar 15, 2021 16:25

Macichne Leainig: Jul 26, 2012; by VG

You know, I do a lot of stuff with Pandas, but after reading some of the recent posts in this thread... :ohdear:

# ? Mar 15, 2021 19:41

a foolish pianist: May 6, 2007; (bi)cyclic mutation

Protocol7 posted:

You know, I do a lot of stuff with Pandas, but after reading some of the recent posts in this thread...

Pandas is cool and good, and don't let this thread put you off it.

# ? Mar 15, 2021 19:46

Macichne Leainig: Jul 26, 2012; by VG

a foolish pianist posted:

Pandas is cool and good, and don't let this thread put you off it.

It is, but I'm just using it to iterate over rows and sometimes filter rows based on if the value equals a specific value. It's probably a bit overkill for what I need and I have been burned by it specifically before.

For example, did you know if you are copying items from one spreadsheet to another, that you need to specify pandas.set_option('display.max_colwidth', SOME_LARGE_VALUE) or else it will just silently truncate strings longer than 100 characters by default? Because I sure didn't.

# ? Mar 15, 2021 19:54

Foxfire_: Nov 8, 2010

Protocol7 posted:

You know, I do a lot of stuff with Pandas, but after reading some of the recent posts in this thread...

It is fine and good if the problem you are trying to solve matches up with what it is designed to do. It is a good tool for interactively viewing/manipulating small amounts of data. It is not a good tool for manipulating multi GB datasets or doing things you're going to run a million times and you care about performance. If you're only going to run something once and it still only takes a few seconds, doing it 100X slower is a good trade for 5 mins of your time.

Protocol7 posted:

For example, did you know if you are copying items from one spreadsheet to another, that you need to specify pandas.set_option('display.max_colwidth', SOME_LARGE_VALUE) or else it will just silently truncate strings longer than 100 characters by default? Because I sure didn't.

Are you copy/pasting its display output somewhere? The REPL output is intended for humans, not tools, and humans usually don't want to see giant strings in their tables, so that's not unexpected. If you're trying to move data somewhere, tell it to write csv or excel or something (unless you are doing that and it still truncates, in which case, eww)

# ? Mar 15, 2021 20:11

QuarkJets: Sep 8, 2008

a foolish pianist posted:

Pandas is cool and good, and don't let this thread put you off it.

It's cool and good but not very performant, that's all. A lot of usecases don't need any extra performance

# ? Mar 15, 2021 21:05

Macichne Leainig: Jul 26, 2012; by VG

Foxfire_ posted:

(unless you are doing that and it still truncates, in which case, eww)

This is the case unfortunately. I've got a dozen or so functions I use to process spreadsheets regularly where the use case is something like:

Python code:

def copy_info_from_dataframe():
    input_df = pd.read_csv("data.csv")
    copy_from_df = pd.read_csv("other.csv")

    output = []

    for _, row in input_df.iterrows():
        matching_donor_row = copy_from_df.query(f"ID == '{row['ID']}'")
        if len(matching_donor_row) == 1:
            row["column"] = matching_donor_row["column"]

            output.append(row.copy())

        print("Processed", row["ID"])


    output_df = pd.DataFrame(data=output)
    output_df.to_csv("output.csv", index=False)

It works, but I feel like using Pandas for this is like trying to smack a fly with a sledgehammer in a sense. And it's definitely not fast when running on 100k+ row long CSV files.

# ? Mar 15, 2021 21:26

lazerwolf: Dec 22, 2009; Orange and Black

Protocol7 posted:

This is the case unfortunately. I've got a dozen or so functions I use to process spreadsheets regularly where the use case is something like:
Python code:
def copy_info_from_dataframe():
    input_df = pd.read_csv("data.csv")
    copy_from_df = pd.read_csv("other.csv")

    output = []

    for _, row in input_df.iterrows():
        matching_donor_row = copy_from_df.query(f"ID == '{row['ID']}'")
        if len(matching_donor_row) == 1:
            row["column"] = matching_donor_row["column"]

            output.append(row.copy())

        print("Processed", row["ID"])


    output_df = pd.DataFrame(data=output)
    output_df.to_csv("output.csv", index=False)
It works, but I feel like using Pandas for this is like trying to smack a fly with a sledgehammer in a sense. And it's definitely not fast when running on 100k+ row long CSV files.

Have you tried using pandas.merge for something like this?

Python code:

def copy_info_from_dataframe():
	input_df = pd.read_csv("data.csv")
	copy_from_df = pd.read_csv("other.csv")
	merged_df = merge(input_df, copy_from_df, how="inner", on=[ List of Columns that you match on i.e. ID ]) 
	merged_df = merged_df[[ select what columns you want to output here ]]
  
	merged_df.to_csv("output.csv", index=False)

Depending on how you want your output to look should dictate which how method you use. Inner join here will produce an output file that is only ID fields found in both files which should match the condition in your loop above.

I like pandas and use it a fair bit at work for processing table based outputs for reformatting and adding some calculations before generating an output file. It can be cleaner than using a bunch of loops through each line of a file. However, R's tidyverse package is far superior to pandas in terms of syntax.

Python code:

df = df[['Column1', 'Column2']]
df = df[df['Column1' > 12]]
df.to_csv(etc)

R code:

df %>% select('Column1', 'Column2') %>%
   filter('Column1' > 12) %>%
   write(etc)

lazerwolf fucked around with this message at 21:51 on Mar 15, 2021

# ? Mar 15, 2021 21:43

Biffmotron: Jan 12, 2007

Yeah, df.query() is especially slow, taking about a ms per use. It makes for clearer code to read, but if you have to do it 100k times, it'll be slow.

pd.merge() with an inner join is the right way to go. My one note would be that for the example code, you also need to get rid of any row in copy_from_df that has an ID that appears more than once, so do this before the merge.

Python code:

vc = copy_from_df ['ID'].value_counts() #get value_counts once and save
copy_from_df['vc'] = copy_from_df['ID'].apply(lambda x: vc[x])
copy_from_df=copy_from_df.query('vc == 1')

# ? Mar 15, 2021 22:29

salisbury shake: Dec 27, 2011

Da Mott Man posted:

Don't know if you figured out a solution for this, but https://docs.python.org/3.8/library/signal.html#signal.signal is what I've used in the past to trigger cleanups.

I was originally using signal handlers, but I was hoping weakref.finalize or atexit would yield a more elegant solution. Ended up just sticking with the signal handlers.

# ? Mar 15, 2021 22:35

Macichne Leainig: Jul 26, 2012; by VG

Cool, thanks for the pandas merge info friendly goons. My spreadsheets come out of a MySQL database that has proper uniqueness enforced so I think that should work perfectly.

# ? Mar 16, 2021 00:58

Hadlock: Nov 9, 2004

We have an extensive and 5+ year old django monolith and I'd like to hammer 4-5 critical, internal API endpoints with fuzzed data and watch where it breaks, i.e. load testing

Is there anything django specific tooling for this, or is it best to just go look at the data contracts and write something that can run async for each fairly standard django api endpoint

# ? Mar 19, 2021 21:28

Jose Cuervo: Aug 25, 2004

I want to build what I think is a pretty simple database using SQLite and SQLAlchemy.

The database is going to hold a subset of data from a different database where each patient already has a unique numeric identifier (Medical Record Number, MRN).

The database is intended to hold information relevant to someone undergoing dialysis. In addition to the usual demographic information the patient will also have
1. A diabetic status (which can change over time from No to Yes)
2. Lab values (these patients get labs run at least once a month)
3. Hospitalizations

Below is what I have started to code up. In particular, I want to know if
1. I have defined the primary key in the Patient class correctly (I want to use the MRN as the primary key because data being added to the database can be uniquely identified as belonging to a patient with that MRN),
2. I have defined the relationships correctly,
3. I have defined the Foreign Keys correctly, and
4. What I need to add so that when a particular patient is removed from the 'patients' table (see below), all their associated data from the diabetic and hospitalizations tables is also removed.

Python code:

from sqlalchemy import create_engine, Column, Integer, Float, \
                        String, Date, Boolean, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship


Base = declarative_base()

class Patient(Base):
    __tablename__ = 'patients'

    id = Column(Integer)
    MRN = Column(Integer, primary_key=True)
    first_name = Column(String)
    last_name = Column(String)
    date_of_birth = Column(String) 

    diabetic = relationship("Diabetic", back_populates='patient')
    lab_values = relationship('LabValues', back_populates='patient')
    hospitalizations = relationship('Hospitalization', back_populates='patient')

    def __repr__(self):
        return f'<Patient({self.last_name}, {self.first_name}' \
               f' ({self.MRN}, {self.date_of_birth})>'
    
    
class Diabetic(Base):
    __tablename__ = 'diabetic'
    
    id = Column(Integer, primary_key=True)
    date = Column(Date)
    diabetic = Column(Boolean)
    MRN = Column(Integer, ForeignKey('patients.MRN'))
    
    patient = relationship('Patient', back_populates='diabetic')   


class LabValues(Base):
    __tablename__ = 'lab_values'
    
    id = Column(Integer, primary_key=True)
    date = Column(Date)
    name = Column(String)
    units = Column(String)
    value = Column(Float)
    MRN = Column(Integer, ForeignKey('patients.MRN'))
    
    patient = relationship('Patient', back_populates='lab_values')
    
    def __repr__(self):
        return f'<LabValue ({self.date}, {self.name}, {self.value} {self.units})>'
		
	
class Hospitalization(Base):
    __tablename__ = 'hospitalizations'
    
    id = Column(Integer, primary_key=True)
    date = Column(Date)
    hospital_name = Column(String)
    admit_reason = Column(String)
    length_of_stay = Column(Integer)
    MRN = Column(Integer, ForeignKey('patients.MRN'))
    
    patient = relationship('Patient', back_populates='hospitalizations')

EDIT: And if it was not clear this is my first time working with databases, so what I have here might be completely wrong.

# ? Mar 23, 2021 03:18

Da Mott Man: Aug 3, 2012

Jose Cuervo posted:

Stuff

Its been a while working with sqlalchemy but if I remember correctly your id columns should have autoincrement=True and for speed of lookups you should have an index=True for any column you would use to select records.
You might also want to cascade delete child data if the patent is deleted for some reason.
I would also set the dateofbirth column to a DateTime instead of a string, it would be easier to query for patents of specific age or range of ages.

Da Mott Man fucked around with this message at 07:13 on Mar 23, 2021

# ? Mar 23, 2021 07:06

Hollow Talk: Feb 2, 2014

Jose Cuervo posted:

Python code:

from sqlalchemy import create_engine, Column, Integer, Float, \
                        String, Date, Boolean, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship


Base = declarative_base()

class Patient(Base):
    __tablename__ = 'patients'

    id = Column(Integer)
    MRN = Column(Integer, primary_key=True)
    first_name = Column(String)
    last_name = Column(String)
    date_of_birth = Column(String) 

    diabetic = relationship("Diabetic", back_populates='patient')
    lab_values = relationship('LabValues', back_populates='patient')
    hospitalizations = relationship('Hospitalization', back_populates='patient')

    def __repr__(self):
        return f'<Patient({self.last_name}, {self.first_name}' \
               f' ({self.MRN}, {self.date_of_birth})>'
    
    
class Diabetic(Base):
    __tablename__ = 'diabetic'
    
    id = Column(Integer, primary_key=True)
    date = Column(Date)
    diabetic = Column(Boolean)
    MRN = Column(Integer, ForeignKey('patients.MRN'))
    
    patient = relationship('Patient', back_populates='diabetic')   


class LabValues(Base):
    __tablename__ = 'lab_values'
    
    id = Column(Integer, primary_key=True)
    date = Column(Date)
    name = Column(String)
    units = Column(String)
    value = Column(Float)
    MRN = Column(Integer, ForeignKey('patients.MRN'))
    
    patient = relationship('Patient', back_populates='lab_values')
    
    def __repr__(self):
        return f'<LabValue ({self.date}, {self.name}, {self.value} {self.units})>'
		
	
class Hospitalization(Base):
    __tablename__ = 'hospitalizations'
    
    id = Column(Integer, primary_key=True)
    date = Column(Date)
    hospital_name = Column(String)
    admit_reason = Column(String)
    length_of_stay = Column(Integer)
    MRN = Column(Integer, ForeignKey('patients.MRN'))
    
    patient = relationship('Patient', back_populates='hospitalizations')

I'm not a fan of pro-forma integer id columns. I see these columns often, and they are useless more often than not (auto-incrementing columns are only really useful if you want to check for holes in your sequence, i.e. for deleted data, and that can be solved differently as well).

Patient already has a primary key, so the id column is superfluous. For the other tables, keys depend on how often you expect to load new data. If this only gets new data once a day, just make the combination of MRN + date the table's primary key (i.e. a compound key) -- SQLAlchemy lets you set multiple columns as primary keys, and it just makes a combined key out of it. If you update more frequently than daily, I would change date to a timestamp, and use that. Realistically, depending on your analysis needs, you will probably end up joining and either selecting the full history for a single patient, or the current status for all patients (or a subset thereof), which this would cover either way. For LabValues, you might need MRN + date + name (assuming name is whatever was tested in the lab).

Essentially, I am of the opinion that primary keys should have meaning on their own, which sequences (i.e. what SQLAlchemy will most likely create under the hood if you have an auto-incrementing integer column and if the underlying database supports sequences) most often do not.

Edit: This is on top what has already been said. Also, foreign keys look good to me.

Hollow Talk fucked around with this message at 13:17 on Mar 23, 2021

# ? Mar 23, 2021 13:14

M. Night Skymall: Mar 22, 2012

You should probably not use the MRN as the identifier throughout your database. It's generally best practice with PHI to isolate out the MRN into a de-identification table and use an identifier unique to your application to identify the patient. It's easy enough to do a join or whatever if someone wants to look people up by MRN, but it's useful to be able to display some kind of unique identifier that isn't immediately under all the HIPAA restrictions as PHI. Even if it's as simple as someone trying to do a bug report and not having to deal with the fact that their bug report must contain PHI to tell you which patient caused it. Not vomiting out PHI 100% of the time in error messages, things like that. Just make some other ID in the patient table and use that as the foreign key elsewhere.

# ? Mar 23, 2021 15:32

xtal: Jan 9, 2011; by Fluffdaddy

This isn't actual healthcare code being written in Python on something awful right?

# ? Mar 23, 2021 15:38

M. Night Skymall: Mar 22, 2012

xtal posted:

This isn't actual healthcare code being written in Python on something awful right?

I run a backend for a healthcare reporting tool in Python, would you uh..prefer MUMPS? You really think that's a good language?

# ? Mar 23, 2021 15:41

a foolish pianist: May 6, 2007; (bi)cyclic mutation

xtal posted:

This isn't actual healthcare code being written in Python on something awful right?

Is there some reason healthcare stuff shouldn't be written in Python?

# ? Mar 23, 2021 15:42

xtal: Jan 9, 2011; by Fluffdaddy

I guess I had hoped it was something with correctness guarantees, but the really troubling thing here is asking goons for help with healthcare code

Edit: upon further thought, Python would suck for healthcare too, you'd probably still be using 2.x and Therac-25 someone whenever a time objected tested falsy at midnight

xtal fucked around with this message at 15:45 on Mar 23, 2021

# ? Mar 23, 2021 15:42

M. Night Skymall: Mar 22, 2012

xtal posted:

I guess I had hoped it was something with correctness guarantees, but the really troubling thing here is asking goons for help with healthcare code

Edit: upon further thought, Python would suck for healthcare too, you'd probably still be using 2.x and Therac-25 someone whenever a time objected tested falsy at midnight

I guess work in healthcare first and then lemme know if you still want to complain about Python of all things, definitely the least of healthcare software's problems. I mean, I doubt anyone's using it to administer drugs with a raspberry Pi, but "I need to aggregate a bunch of data in disparate formats and reason about it" is a very healthcare and python thing to do.

# ? Mar 23, 2021 15:47

necrotic: Aug 2, 2005; I owe my brother big time for this!

xtal posted:

I guess I had hoped it was something with correctness guarantees

Like what. MUMPS?

# ? Mar 23, 2021 15:48

Adbot: ADBOT LOVES YOU

# ? May 15, 2024 11:44

xtal: Jan 9, 2011; by Fluffdaddy

I honestly don't know that much about MUMPS but if so maybe that's why they keep using it!

# ? Mar 23, 2021 15:49

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

«‹›230 »