Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Foxfire_
Nov 8, 2010

KingNastidon posted:

Just getting started with pandas, so this is probably a really naive question. I'm trying to replicate a really basic algorithm from VBA for oncology forecasting. There is a data file with new patient starts and anticipated duration of therapy in monthly cohorts. Each new patient cohort loses a certain amount of patients each time period using an exponential function to mimic the shape of a kaplan-meier progression curve. The sum of a column represent the total remaining patients on therapy across all cohorts. Below is the code I'm using:

code:
for x in range(0, nummonths):
    for y in range(x, nummonths):
        cohorts.iloc[x,y] = data.ix[x,'nps'] * math.pow(data.ix[x,'duration'], y-x)
Calculating this n by n array with n>500 is instantaneous in VBA, but is taking 10+ seconds in pandas with jupyter. Any thoughts on what makes this so slow in pandas or better way to go about this?

Python is extremely slow and extremely RAM hungry. This is a consequence of aspects of its type system and design. The interpreter has no way to know what to do for any operation without going through a dance of type lookups. It also can't prove that those operations are the same from loop to loop, so it has to redo all that work for every element.

The trick to get acceptable performance is to arrange it so that none of the actual computation is done in python (so it is fast) and none of the data is saved as python objects (so it is RAM efficient). You essentially have a bunch of python glue code that is telling external libraries what to do in broad strokes, with the actual computation done by not-python.

The main library you use to do this is numpy. Pandas is using numpy arrays as the storage for its DataFrame and Series values

A line like:

code:
foo = numpy.zeros(100, dtype=numpy.int32)
is saying "Numpy, allocate an array of 100 int32's and return a python object that represents that array". The actual array is allocated from the C runtime's heap and isn't made up of python objects.

Something like:

code:
foo *= 2
is saying "Numpy, go find the array represented by foo and multiply all its values by 2". No python code runs to do the multiplications.

A piece of code like this:
code:
for i in range(len(foo)):
  foo[i] *= 2
is saying, "In python, have i iterate from 0 to the length of foo. For each index, ask numpy to construct a python object who has the same value as the array at that index. Lookup its type, figure out what *=2 means for it (maybe it's integer math, maybe it's float math, maybe it's string concatenation, etc...), do that operation on the python object, then ask numpy to change the value of the array at that index to have the same value as the python object."

Basically any time you loop over your data, the performance will be crap.

Your options are:
1. Rearrange your code so that it's asking numpy to do bulk operations instead of doing them in python. This can be hard/impossible to do and isn't usually how people think about their problems.
2. Write your algorithm that's hard to express in C so that it's a new bulk operation you can do. Alternatly, dig through libraries to see if someone else has already done this for the algorithm you want.
3. Use numba. numba is a python-like sublanguage where enough python features are removed and types are locked down that it can be efficiently compiled. Depending on your algorithm rewriting it in numba could require zero work or could be more annoying that redoing it in C. (The example you posted would be on the zero work end)

Adbot
ADBOT LOVES YOU

Foxfire_
Nov 8, 2010

Malcolm XML posted:

This has little to do with python being compiled or not: pandas has a ton of overhead if you don't do things the correct way

Huh, I was expecting the pandas overhead to be much smaller. Some testing shows that it goes:

vectorized numpy > vectorized pandas >> iterative numpy >>> iterative pandas

An iterative solution in something with known types (C/Fortran/Java/numba/Julia/whatever) will still be faster for complicated calculations than the vectorized versions since the vectorized ones destroy all cache locality once the dataset is big enough (by the time you go back to do operation 2 on item 1, it's been booted out of the cache), but you can still get enough speedup to move some algorithms from unworkable to useable.

Python's speed problems mostly don't have to do with it being compiled in advance or not. They're consequences of how it works with types and the stuff it lets you change. It's basically impossible to JIT compile stuff like you would in Java because you can't prove that a loop isn't doing things like monkey patching functions or operators at some random loop iteration in the middle. The interpreter has to execute what you put down naively since it can't prove that it's safe to take any shortcuts.

Timing results:
code:
import numpy as np
import pandas as pd

numpy_array1 = np.random.rand(100000)
numpy_array2 = np.random.rand(100000)

print("Vectorized numpy")
%timeit  out = numpy_array + numpy_array2

print("Iterative numpy")
out = np.empty(100000, dtype=np.float64)
%timeit for i in np.arange(100000): out[i] = numpy_array1[i] + numpy_array2[i]

pandas_dataframe = pd.DataFrame({'A': numpy_array1, 'B': numpy_array2})
print("Vectorized pandas")
%timeit out = pandas_dataframe.A + pandas_dataframe.B

print("Iterative pandas")
out = np.empty(100000, dtype=np.float64)
%timeit for i in np.arange(100000): out[i] = pandas_dataframe.A.iloc[i] + pandas_dataframe.B.iloc[i]
code:
Vectorized numpy
10000 loops, best of 3: 150 µs per loop
Iterative numpy
10 loops, best of 3: 52.1 ms per loop
Vectorized pandas
10000 loops, best of 3: 181 µs per loop
Iterative pandas
1 loop, best of 3: 4.3 s per loop

Foxfire_
Nov 8, 2010

Linear Zoetrope posted:

Do you need to touch the reference counters or GIL if you want to spawn a bunch of threads from the C side, call Python functions that return PyObjects, and forward the PyObjects between the threads? To be clear, only one thread will touch a given PyObject * at a time.

You can pass the pointers themselves amongst your threads however you want. But at most one thread may be invoking Python code at any time and theoretically it should own the GIL. You can't be accessing multiple PyObject's simultaneously, even if any particular object is only accessed by one thread.

I think you avoid GIL stuff if you can guarantee:
- your threads will synchronize themselves to avoid simultaneously trying to run Python code
- no threads are ever created from Python

The interpreter would be running in its single-threaded mode where the GIL doesn't actually exist and there's only one possible value of it's global ThreadState structure in play. CPython doesn't use any OS thread-local variables, so it shouldn't care what OS thread is actually running its code, as long as there's only one at a time.

Foxfire_
Nov 8, 2010

creatine posted:

Does anybody know of a python3 module that let's you work with .FCS files? All the ones I can find are 2.7 only

https://github.com/eyurtsev/fcsparser claims to do it, I've only ever used FlowCytometryTools (which is 2.7 only)

Depending on what exactly you need, it's not too horrible to write something yourself to yank out the data and compensation matrix.

Foxfire_
Nov 8, 2010

pudb's my favorite if you're on unix

Foxfire_
Nov 8, 2010

salisbury shake posted:

I'm just learning to use NumPy and had to poke around the docs to do what I want. While the above code finds the solution correctly, I'm unsure if I'm using NumPy efficiently or even canonically.

For example, in get_fabric(), I'm trying to minimize sequential Python code in favor of using NumPy's ufuncs and routines where possible, but I'm still calling np.add.at() over a thousand times in total. That feels wrong, but I don't have the context to really know if/why it's wrong.

Outside of a code-golf type fun challenge, there's not really a reason to vectorize this kind of thing. It will make it less readable, slower, and more memory hungry than a loop version.

The problem is naturally solvable and readable with a loop, and loops are what computers are good at. The problem with doing a naive loop is that Python is very, very slow.

To get your math to run at a reasonable speed, you need to get the computation out of Python. You can do that by using NumPy ufuncs so that the only bit of Python code that runs is invoking the ufunc a few times and having each ufunc [which is implemented in C] do lots of math. But that means you have to figure out a way to vectorize instead of doing it the natural way.

An alternative is to use numba. It provides a decorator that you can apply to your Python function that says "instead of executing this function in the Python interpreter, just-in-time compile it into normal instructions the first time it is used and automatically generate the boilerplate to pass data back and forth". It can only work with a subset of the Python language, but it's often good enough.

Taking your get_fabric function and massaging it a bit (I don't remember offhand if namedtuples are supported in numba):

code:
@numba.jit(nopython=True)
def get_fabric(claims, length):
    fabric = np.zeros((length, length), dtype=np.int8)
    
    # add claims to fabric
    for claim_row in range(claims.size[0]):
        claim = claims[claim_row, :]
        x = claim[0]
        y = claim[1]
        width = claim[2]
        height = claim[3]

        fabric[y:y+height, x:x+width] += 1
    
    return fabric
will be efficient.

Bonus thoughts for refining your solution:
- With your solution, what happens if you have a million elves on a 10x10 fabric?
- Can you avoid having to allocate the length x length array? i.e. can you make it work if the fabric is 10000000 x 10000000 as long as the claim list is short?

Foxfire_
Nov 8, 2010

You can look at the dictionaries themselves and see where the entries are:

code:
class Class:
    """ I'm a docstring! """
    def __init__(self, name):
        self.name = name
    name = "Whee"

test = Class("Foo")
test2 = Class("Bar")

print("Class's dict:")
for key, value in Class.__dict__.items():
    print("{0:>15}: {1}".format(key, value))
print()
print()
print("test's dict:")
for key, value in test.__dict__.items():
    print("{0:>15}: {1}".format(key,value))
print()
print()
print("test2's dict:")
for key, value in test2.__dict__.items():
    print("{0:>15}: {1}".format(key,value))
code:
Class's dict:
     __module__: __main__
    __weakref__: <attribute '__weakref__' of 'Class' objects>
       __dict__: <attribute '__dict__' of 'Class' objects>
       __init__: <function Class.__init__ at 0x7f76739a9158>
           name: Whee
        __doc__:  I'm a docstring! 


test's dict:
           name: Foo


test2's dict:
           name: Bar
When you access thing.whatever, it searches for an entry for "whatever" in thing.__dict__, then searches thing.__class__.__dict__ if it didn't find anything*


*this is a little bit a lie. There are some other names that are somewhere else. That's why you don't see an entry for "__class__" in test and test2's dicts even though you can access "test.__class__" without getting an error. And there's a few more places that get searched for base clases.

Foxfire_
Nov 8, 2010

Installing opencv is kind of a clusterfuck and I wouldn't really expect pip-ing it to work. From vague memories of the last time I did that, pip-ing it only installs the python wrapper, not opencv itself.

cv2.pyd is the dll with the python wrapper. It implicitly links to all the actual opencv dlls. Probably one or more of those is missing or not locateable.

Take dependency walker and open cv2.pyd with it. It will show you everything cv2.pyd tries to load, everything its immediate dependencies tries to load, etc... Every implicit load has to be satisfied or cv2.pyd will fail to load. Delay-loaded ones don't have to be satisfied (delay-load = the program tries to LoadLibrary() in some code path, which (A) may never actually be executed and (B) it can handle LoadLibrary() returning an error)

Foxfire_
Nov 8, 2010

Write it back out to disk as a big binary array, memmap it, and let the OS swap it in and out as needed?

Foxfire_
Nov 8, 2010

SirPablo posted:

Not sure I follow this. I tried a memmap but seemed to just hose my machine.

If you try to read 50GB into memory, you'll get an out-of-memory error since you don't have that much RAM (or if you have 50GB+ of swap enabled, you'll effectively get a crappy memmaped behavior).

Memory mapping it is telling the OS "When I try to access something in this region of memory, go read the surrounding little area from disk and pretend that was in the memory+cache it so that if I access it again you don't have to go all the way to disk". If you touch a new area and there's no free RAM to use to cache it, some other disk-backed page will get kicked out of RAM according to the OS's policies and the next program that touches that will have to wait while it reads in from disk (and boots something else).

If you're accessing the 50GB in a random unpredictable way (so that every read/write is likely to have to be loaded from disk and boot some other page to make space), there's not really anything you can do to make it run well besides get more RAM. But if the access pattern is sequential or has locality like most do, accesses will tend to be to regions that are already cached.

Foxfire_
Nov 8, 2010

Sad Panda posted:

Poorly explained. I'll try again.

I've got some Python code. It uses Pillow, OpenCV and PyAutoGUI to do some clicking based on images. At the moment, it is running on an ESXi Windows 10 server that I connect to from a Macbook using Microsoft Remote Desktop. If the Remote Desktop window is open (on my Mac) then it plays wonderfully. If the Remote Desktop window is closed, it fails and throws the OSError that I posted earlier. I assume there is some check that is run that tells Windows that there's not really a display that it is outputting to.

I had a quick look at PyVirtualDispaly and it seems that it is *nix only.

It's been awhile since I've had to care, but if I'm remembering right, when you drop your remote desktop session, that desktop stops existing and all the OS resources that were allocated for it get freed.

A desktop is a thing in Window's object model and participates in a bunch of stuff for both security and resources. Example: the administrator password prompt runs on its own desktop. You can't send or receive messages from windows on it from the default interactive desktop, they don't share clipboards, etc...

I think video memory will also be deallocated, so probably your immediate problem is that there isn't any buffer to screen grab from anymore. When you log back in, memory gets allocated again and all windows get told to repaint themselves.

Foxfire_
Nov 8, 2010

You won't do better than the C windows API way of doing it since that's underneath everything Python does [and Python's underlying thread scheduling is terrible]. That will still give ~20ms jitter if other things want CPU and don't yield sooner, so if that is too much, you'll need to do something else.

There you would do it by using

CreateWaitableTimer to create a timer object
SetWaitableTimer to configure it as periodic with some beat
WaitForSingleObject to sleep until the timer expires

You get bonus priority from the scheduler because you're waiting on IO.

You could ctypes it from python. I don't think there's a standard way to sleep for anything but a relative duration, so otherwise the best you could do is calculating an approximate wake time and doing a relative sleep (then either spinning until its time or accepting whatever inaccuracy is left).

But if you have a microcontroller, just use that. It is much better at doing things with consistent timing than a big computer.

Foxfire_
Nov 8, 2010

TwystNeko posted:

Okay, so I've got this ESP32 working rather well with Micropython, controlling a MAX 7219-based LED matrix, and capacitive touch sensing for buttons. However, I need some help with interrupts.I think.

To prevent this being an X/Y problem, my end goal is to be displaying an "idle" animation on the matrix, and when the touch sensor is triggered, display a different image in increments of 15 seconds, then going back to the idle animation.

The nice thing about using the MAX7219, it "holds" the image displayed until specified otherwise, so I can just sleep the ESP32 for however long. What's the 'best' way to do this?

This is what I've got right now:

code:
import max7219, framebuf, time
from machine import Pin, SPI, TouchPad
spi = SPI(1, baudrate=10000000, polarity=1, phase=0, sck=Pin(4), mosi=Pin(2))
ss = Pin(5, Pin.OUT)
d = max7219.Matrix8x8(spi, ss, 2)
t = TouchPad(Pin(14))

def loadbmp(bmp):
    with open(bmp, 'rb') as f:
        f.readline()
        f.readline()
        f.readline()
        data = bytearray(f.read())
    fbuf = framebuf.FrameBuffer(data, 8,8, framebuf.MONO_HLSB)
    return fbuf

while True:
    if t.read() < 450:
        d.blit(loadbmp('leftarrow.pbm'),0,0)
        d.show()
        time.sleep_ms(5000)
    else:
        d.blit(loadbmp('sq1.pbm'), 0,0)
        d.show()
        time.sleep_ms(1000)
        d.blit(loadbmp('sq2.pbm'), 0,0)
        d.show()
        time.sleep_ms(1000)

It works, but if it's triggered during the animation it has to wait up to 2 seconds. TBF, it's turn signals for a bicycle, so it doesn't have to be instant.

Set up two ISRs:
- On the GPIO activating, set a flag then return.
- On the timer expiring, set a different flag then return (not really needed since you can infer it from the first one)

Forever:
- Update the display with the picture for this state
- Load the timer with the time till the next frame and start it
- Sleep the chip till there's an interrupt
- Mask interrupts
- Disable timer
- Copy global flags to locals and clear global ones
- Unmask interrupts
- Decide new state based on whether it was timer or button that woke you up

Depending on hardware, you might need to do something fancier to debounce the GPIO

Foxfire_
Nov 8, 2010

You're writing to stdout and it's line buffered by default. It won't flush to the OS unless a buffer fills up or you write a newline.

use print(whatever, flush=True)

Foxfire_
Nov 8, 2010

Windows also accepts either direction

Foxfire_
Nov 8, 2010

That looks like correct behavior?

On windows, the paths "..\this\that.txt" and "../this/that.txt" both mean "up one directory, into a directory named this, file name that.txt"
On linux, the path "../this/that.txt" means that, but the path "..\this\that.txt" means "file named ..\this\that.txt" in the current directory. ..\this\that.txt is a legal posix filename because posix filenames are terrible (newlines and nonprinting UTF-8 characters are also legal posix filenames)

If you want to make that be interpreted as a path with directories on linux, you will need to either explicitly tell it to interpret it as a windows path or manipulate the string into a posix path

Foxfire_ fucked around with this message at 20:08 on Oct 13, 2020

Foxfire_
Nov 8, 2010

If you're composing the paths, just always use forward slashes since windows will also accept that.
If you're accepting the paths from users, you'll need to decide how to interpret what they mean when they enter "this\that" on linux since it technically means filename but they probably didn't mean that if they are good upstanding people

Foxfire_
Nov 8, 2010

That sounds large enough & you caring about performance enough that using python is a bad idea. Python is very, very slow. If you're doing large numerical stuff, you can really only use it as a glue language to connect together pieces implemented in something else. Backing it with rust or C, or writing in a python-adjacent language like Cython or numba seem like better ideas to me.

How many times are you going to use this thing and how many points are you going to classify? If it's not that many, a naive iteration through all points for every lookup in not-python might be the fastest thing considering your time to implement it. 20,000 points is not that many. Like making up some numbers and calling each point 100 doubles long, that's only 16MB of memory to go through per classification.

Foxfire_
Nov 8, 2010

The original reason for Python not having static types is that it makes writing the interpreter easier. In the interpreter, every single python object is the same type (PyObject) and supports exactly the same set of operations & data. Python grew out of a one man hobby project and lots of its warts stem from that. It's predecessor language (ABC) did have a static type checker (and so did Algol, FORTRAN and C, all of which are much older)

Dynamic types also make objects changeable at runtime, which is useful for modifying stuff for tests or messing with other peoples code. Every function/attribute access is internally "Look up a value in a map of string -> PyObject, throw AttributeError if that key doesn't exist", so you can trivially rewire those mappings at runtime. Also people like having less stuff to type when working interactively or doing little shell script type things and those were python's main target area for most of its history. Guido van Rossum was resistant to adding type hints up until he moved from academia to having to maintain large codebases.

Foxfire_
Nov 8, 2010

salisbury shake posted:

The above is not only slow, it maxes out my CPU at only something like 150MB/s.

I would not be surprised if you are just running into python being slow as a limit. You are doing:
- Lookup what 'next' means on the stdin.buffer object
- Call it, it reads data into an internal buffer
- Allocate a new python object and copy the bytes into it (I think you are at least skipping the bytes -> UTF-8 conversions the way you are doing it)
- Lookup what 'write' means
- Call it, copy the data into another internal buffer (occasionally flush that buffer to the OS)
- Deallocate the python object
- Loop


The shell pipe redirections are just renaming things so that the output of the first and the input of the second are the same thing. You can't introduce yourself in the middle of that because there is no middle, they're one file with two names.

If you want to insert your program into a pipeline and don't care about anything besides counting, the fastest way will be to use os.open() / os.read() / os.write() and a decent sized buffer size to skip as much python as possible. You'll still be allocating&deallocating python objects, copying data, and doing dictionary lookups for every chunk though, so it'll be slower than a C one

Mirconium posted:

For multiprocessing pools, should I be spawning one any time I want to async-map over an iterator, or can I just create a single pool for the arena that I am working with and then summon it by reference any time I need to go over an iterator? I'm not planning to actually asynchronously run anything, I just want to parallelize some iterative computations.

Be aware that by default, python on unix multiprocessing does invalid things. It will fork() to create copies of the existing process, then try to use them as pool workers, which violates POSIX. It will usually happen to work for most common libc implementations as long as absolutely everything in the process is single threaded. multiprocessing.set_start_method('spawn') will fix it.

Foxfire_
Nov 8, 2010

Tayter Swift posted:

I have a 40 GB CSV, about 50 million records by 90-ish fields. I need to sort it by fields VIN and Date, and remove duplicated VINs by most recent Date. My machine has 64GB.

What's a sensible way to accomplish the task? After compacting categories in dask, saving it as parquet and re-reading it into a pandas df it's about 21GB, but it still cant be easily manipulated in pandas, and I know these are not dask-friendly parallelizable operations.

Turn it into a numpy record array, then do the sorting via numpy. If the de-duplication is not trivial to do via numpy interface, write it in numba so you can do a natural iterative thing instead of trying to hammer it into array operations with many temporary copies. If still too big for memory, np.memmap() is easy and won't be that much slower.

e: Type hint looks correct to me, except that it should be Optional[Union[int, range, Iterable[Union[int, range]]]] if you're allowing None like the default suggests. Not surprising that tooling doesn't understand complicated things

I read it as foo is one of:
- None
- int
- range
- Iterable where each entry is either int or range

Foxfire_ fucked around with this message at 23:10 on Feb 22, 2021

Foxfire_
Nov 8, 2010

OnceIWasAnOstrich posted:

edit: I'm having a hard time figuring out whether the PEP/mypy actually allow for covariant type lists. I think maybe the type system doesn't allow that for mutable types? I think maybe PyCharm sees list, the type system specifies lists can only be invariant, so it sees a List[int], converts to Iterable[int] and calls it a day?

I don't think it's thought through enough to actually have specified behavior outside of whatever some particular tool assumes. Is Optional[int] a type or a statement about two different types? No actual object will ever have a type of Optional[int]. Some names might variously refer to None or an int, should the name get a 'type' (distinct from the object type) that somehow logically gets attached to it? From CPython's point of view, none of the type hints have anything to do with actual types, so there's nothing there to guide anything.

PEP484 has
code:
T = TypeVar('T', int, float, complex)
Vector = Iterable[Tuple[T, T]]
in one of its examples, so that at least didn't think there was anything wrong with having non-homogenous lists

e: That's maybe a bad example since you could read it as requiring every tuple in the iterable to have the same number type. List[Any] does definitely show up in the PEP though

Foxfire_ fucked around with this message at 00:31 on Feb 23, 2021

Foxfire_
Nov 8, 2010

CarForumPoster posted:

Yea a db is the way to go though I also need to do fuzzy string matching which is a whole other issue but I suppose if I wanna do that I can query all 8M rows of 1 or 2 cols into memory the do my fuzzy matching in python slow as loving can be. Maybe make a AWS Lambda function to parallelize a bunch of these but have the

Your underlying problem is more that pandas in general is trying to optimize for person-time writing code at the expense of being very slow and using lots of RAM. Generally that's a good assumption, but not here if you have too much stuff and are going to run it a lot. pandas is also horrible at strings since the underlying actual things being stored is arrays of PyObject pointers to full python objects elsewhere.

8,000,000 rows x 50 cols isn't that much data. Like if each string is 64bytes, that's still only about 25GB, which is only going to take a second or two to go through if they're already in RAM.

I would:
- Move data to plain numpy arrays of string_ dtype (fixed length, not python objects)
- Do the fuzzy string match in numba. Levenshtein distance implementations are easily googleable
Ought to take less than a second to compute distance between the query and every element in a column

Wouldn't use a database since you won't be able to do the fuzzy matching on the database side, and you want to avoid having to construct python objects or run python code per element.

Foxfire_
Nov 8, 2010

Protocol7 posted:

You know, I do a lot of stuff with Pandas, but after reading some of the recent posts in this thread... :ohdear:

It is fine and good if the problem you are trying to solve matches up with what it is designed to do. It is a good tool for interactively viewing/manipulating small amounts of data. It is not a good tool for manipulating multi GB datasets or doing things you're going to run a million times and you care about performance. If you're only going to run something once and it still only takes a few seconds, doing it 100X slower is a good trade for 5 mins of your time.


Protocol7 posted:

For example, did you know if you are copying items from one spreadsheet to another, that you need to specify pandas.set_option('display.max_colwidth', SOME_LARGE_VALUE) or else it will just silently truncate strings longer than 100 characters by default? Because I sure didn't.
Are you copy/pasting its display output somewhere? The REPL output is intended for humans, not tools, and humans usually don't want to see giant strings in their tables, so that's not unexpected. If you're trying to move data somewhere, tell it to write csv or excel or something (unless you are doing that and it still truncates, in which case, eww)

Foxfire_
Nov 8, 2010

Jose Cuervo posted:

I am on board with doing this, but I am having trouble wrapping my head around how I would accomplish this. Do I define a new column, say pid as the primary_key for the Patient model as below, and use pid as the foreign_key in the other models as below for the Diabetic model?

However, if I do this, then I need to keep a mapping between the MRN and the pid outside of the database, right? So now I have the MRN in two different places...

You have a meaningless-outside-your-application patient ID that uniquely identifies a patient. All of your deidentified tables reference that for knowing what patient a row is about. Then you have a table that maps from the the patient ID to all the PHI about that patient (MRN, names, DOB, ...). That table could be either inside the same database or elsewhere if you wanted to segregate the dataset into PHI and deidentified parts.

You might also need multiple PHI tables for some stuff, like if you want to store multiple different-system/different-site MRNs for one actual person. For each table in the schema, you should be able to classify it as PHI or not and shouldn't be mixing PHI and not-PHI together in the same table. If you've done stuff right, you ought to be able to hand someone all your non-PHI tables and they'd be able to follow information about a particular subject, but not have any PHI about them

Foxfire_
Nov 8, 2010


What you're trying to seems generally right to me. Would toss in a unique constraint on pid if you want to use (MRN, HealthSystem) as the primary key. Also, MRNs probably should be strings, not integers. "012345" and "12345" are probably distinct MRNs.

For the autoincrement, like Wallet said, SQLite has weird autoincrement behavior and specific workarounds in sqlalchemy. Since you don't particularly care that the values are ascending integers as long as they're unique, a reasonable portable thing would be to externally make a UUID and use that. Or you could beat sqlite into giving you a unique integer, but you'll have something somewhat db engine dependent.

e: also be aware that foreign key constraints in SQLite by default do nothing.

Foxfire_ fucked around with this message at 05:07 on Mar 25, 2021

Foxfire_
Nov 8, 2010

OnceIWasAnOstrich posted:

You can create a multiprocessing.Process with the target= argument as the callable you want to run, ideally using a Pipe or Queue to feed your data back to the main process. You then call .start() and then .join(timeout) with whatever timeout you want before checking the Pipe/Queue and calling it a day. If your underlying callable releases GIL you can even do this with threading.Thread instead.
Be aware that by default, multiprocessing on Linux is broken and will randomly deadlock. You need to change the start method from fork to spawn. The MacOS and Windows defaults are already correct.

It's default behavior is to fork() to make a copy of the process, then keep using it without calling exec(). This has always been forbidden by POSIX, but is commonly done. It will generally work fine as long as absolutely everything in the process holds no locks at the moment the fork occurs, which you have no way of assuring that is true in general (you can't easily prove that some library doesn't make a thread that it uses internally)

Foxfire_
Nov 8, 2010

That doesn't actually kill the process, it just stops waiting for the result:
Python code:
import concurrent.futures
import time

def forever():
    while True:
        time.sleep(1.0)
        print("I'm not dead yet")

def main():
    with concurrent.futures.ProcessPoolExecutor() as executor:
        try:
            future = executor.submit(forever)
            time.sleep(5.0)
            future.cancel()
            return 0

        except concurrent.futures._base.TimeoutError:
            raise RuntimeError("Timeout")

        finally:
            print("Shutting down ProcessPool...")
            executor.shutdown(wait=False, cancel_futures=True)


if __name__ == '__main__':
    print(main())
I get
code:
(py392) D:\trash>python stuff.py
I'm not dead yet
I'm not dead yet
I'm not dead yet
I'm not dead yet
Shutting down ProcessPool...
0
I'm not dead yet
I'm not dead yet
I'm not dead yet
I'm not dead yet (repeats forever)

mr_package posted:

I was also surprised that you basically cannot easily kill threads/processes that you launch. There are third-party libraries and what appear-- to me-- to be very hack-y approaches to dealing with this to be found on SO. Perhaps they are fine, perhaps this is just the way it is done. This is my first look at concurrency, do other languages handle this pretty much the same way or is it Python-specific? I'm thinking for example if you were writing something in go or rust or C# or whatever and you wanted to terminate a process early, is it easier?
You may never safely abruptly kill a thread in any language. If you want the ability to cancel things, you need to make your code voluntarily die. The Win32 syscall that does it (TerminateThread) for example is full of nasty warnings that basically boil down into "Never ever call this". The pthreads one (pthread_kill) has less warnings in its docs, but isn't any safer to use.

It is possible to safely abruptly kill a process if you have appropriate permissions, but it's not something I would expect a processing pool library to make easy. Again, your code should have a way to gracefully cancel (returning the worker to the pool), not forcefully rip down a piece of the pool infrastructure and hope the pool implementation heals.

Foxfire_
Nov 8, 2010

mr_package posted:

What does the correct way to do this look like? I have a piece of code I know might take too long. How do I set up the call to it so that it terminates in an appropriate time? This is not a standard thing?

Or, is it that the core issue is socket.gethostbyname() has a 5s timeout and I need to respect that? In other words, shoehorning control of that process into a MP library is the wrong approach: I should not use socket.gethostbyname() at all if the timeout behavior is wrong for my use case.. For example, I could write my own (what could go wrong?) or use another library such as https://www.dnspython.org/ that has a timeout and kills itself rather than my trying to kill it if it takes too long. Is that the way this type of thing should be done?
You can't kill it if it's in the same process as you because it is potentially using global stuff you would also like to use (e.g. if it's in the middle of allocating memory, it may be holding a mutex for the C runtime heap allocator). Ideally, functions would give you a way to tell it to cancel itself that will make it nicely exit

For lots of stuff that wasn't ever intended to be used asynchronously, there won't be a way to do that. gethostbyname() doesn't have a way to cancel it. What you can do is make the code that launched it not wait anymore & make it so that once the slow call does return, your code ignores the result and just exits:

Python code:
def do_the_thing():
  address = socket.gethostbyname("dick.butt")
  if am_i_canceled():
    return
  
  do_stuff_with_address(address)
If your calling code does something to make am_i_canceled() return true, whenever the gethostbyname call does return, it will exit then. It's not really 'canceled' in terms of an abnormal execution path, you're just adding a signaling mechanism and execution path that exits earlier than it would otherwise

mr_package posted:

As an aside, does anyone know offhand what happens when you call subprocess.run() with a timeout? Does the process continue to run in the background and then exit normally however long it takes? I'm wondering if the caller just ignores it and moves on or if it actually communicates to the process that it should quit/cancel (presumably the process would be responsible for then handling this in a mostly graceful way).
You can look at the CPython implementation to see what it does (https://github.com/python/cpython/blob/main/Lib/subprocess.py). On timeout, it calls the python function kill() on the python process object. The documentation for that says that it sends a SIGKILL on Linux/Mac and calls TerminateProcess() on Windows. Neither is ignorable, the operating system will abruptly kill that victim process (which is safe to do* because it doesn't share any state with the parent process. (*unless you specifically set up an inter-process communication method, in which case it would no longer be safe))

Foxfire_
Nov 8, 2010

The first lesson of using Python on bigish datasets is "Don't". It's slow and python objects have huge amounts of overhead.

Only use python as a glue language to manipulate non-PyObject things with not-python implemented code


(this is an old i5-2500K and the dataset is only ~4GB)

e: If you want a python list of python integers at the end, it's O(x) to go back to that. It'll get enormous though because each 4 byte integer is going to turn into a full PyObject. Doing the 'ndarray of a billion integers' -> 'python list of a billion python integers' conversion is slower than generating and sorting the ndarray.

Foxfire_ fucked around with this message at 05:39 on Sep 10, 2021

Foxfire_
Nov 8, 2010

QuarkJets posted:

That code is pretty hard to read

I decided to reimplement this problem in very simple terms: 10 GB of data moved around on a single NVMe drive

t = 33.7 s for single process
t = 33.5 s for multi process walk + transfer
t = 30.2 s for multi process transfer only

Why isn't the performance any better? Probably hardware reasons, I'd guess. I believe that the *nix rsync utility is a single threaded application for this reason
Magnetic disk is much slower than solid state disk
Solid state disk is much slower than main RAM
Main RAM is much slower than L3 cache
L3 cache is slower than L2 cache
L2 is slower than L1
L1 is slower than CPU registers

Copying a file is essentially no CPU work if the underlying copy is implemented sanely. The CPU is telling the disk controller "Copy this sector of data to main memory, interrupt me when you're done", taking a nap, telling the disk controller to copy in the other direction and napping again, then repeating that till everything's copied. Having more or faster CPUs won't help because they're napping 99% of the time anyway

It's like you've got one guy with a shovel and ten people standing around telling him where to dig next. It's not faster than one guy with the shovel and one boss.


e: also, unrelatedly, you have to do import multiprocessing; multiprocessing.set_start_method('spawn') on Unix to get not-broken behavior. The default ('fork') violates fork()'s specification and may randomly deadlock depending on what other code in your process is doing.

Foxfire_ fucked around with this message at 04:36 on Oct 5, 2021

Foxfire_
Nov 8, 2010

There is no expectation that anything pickled is unpicklable when the version of anything changes. It might work, might give you an error, and might silently give you corrupted data. Pickles are not suitable for a nontransient serialized format

Foxfire_
Nov 8, 2010

The pickle protocol may be backwards compatible (that's broken a couple times in Python history, but at least those were considered bugs), but your likelihood of getting usable data back from a pickle that's sat somewhere while a couple years of people and documentation turnover went by isn't that great unless you're doing something like building a manual "convert to/from to a dict of language primitives" pre serialization step. And at that point you might as well be using either a self-describing serialization format, or a binary format that's less 'do everything' but better at the things your application cares about specifically.

(mostly I'm bitter about scikit having no serialization format besides pickles and having to deal with people's broken files from some unknown past version)

Foxfire_
Nov 8, 2010

D34THROW posted:

I was today years old when I discovered float.as_integer_ratio.
I'm having trouble thinking of something that's actually useful for. Display code would want some application-specific approximation instead of the exact values (humans usually don't want to be told that "0.1" is 3602879701896397 / 36028797018963968 instead of 1/10). And if you were doing arbitrary precision math, you shouldn't have floats to begin with

Foxfire_
Nov 8, 2010

D34THROW posted:

Fractional inches.
I guess those happen to work out since all the values you care about happen to be exactly representable fractions of two.

For the cut list, a straightforward debugging method is to print out stuff and see where what actually happens splits from what you think happens:
code:
def calculate_required(item: str, cutlist: list) -> int:
     size = MATERIAL_LENGTHS[item] if item in MATERIAL_LENGTHS.keys() else 288
     remain = [size]
     solution = [[]]
     for piece in sorted(cutlist, reverse=True):
         print(f"Fitting piece {piece} into free space {remain}")
         for num, free in enumerate(remain):
             print(f"Considering free chunk of length {free}")
             if free >= piece:
                 print(f"It fits, leftover freechunk is length {remain[num]-piece}")
                 remain[num] -= piece
                 solution[num].append(piece)
                 break
             else:
                 print("It doesn't fit in this chunk")
                 print("(Why is it moving into the solution and the uncut lengths changing)")
                 solution.append([piece])
                 remain.append(size-piece)
     return len(solution)
Python list iterators internally happen to be implemented by tracking indices. The infinite loop happens when it considers a length, appends it to remain, then the next loop considers that length again.
The else path on that doesn't make sense to me glancing through, the piece doesn't fit into the freespace, but it's messing with the solution and remaining freespace

Foxfire_
Nov 8, 2010

I'm not 100% sure I'm following what you're doing, but I think you have each timespan starting out on its own and you keep merging them until the gaps are bigger than 10mins?

Disjoint set forest is a data structure that can do that efficiently. For N things, they are grouped into sets with each object belonging to exactly one set. Finding which set something belongs to and merging sets is amortized ~O(1)

(But depending on your problem size & how often you will do it, it may not be worthwhile to implement & test it vs just waiting out a trivial python version or doing a trivial implementation in some faster non-python thing (either python-adjacent numba/numpy, or a completely separate language. Or find someone whose already implemented the data structure in python)

Foxfire_
Nov 8, 2010

Do you intend to count a line like

"yellow yellow"

As one yellow count? Also, do you intend it to be case specific?

Foxfire_
Nov 8, 2010

Is it the choices call that's slow or the rest of it?

Non-python code (nump random, and numpy generally) will run much faster. The typical way to do numerical computing with python is to use python as a glue language connecting non-python code (number or numpy)

Foxfire_
Nov 8, 2010

Can you set up some smaller dummy thing that duplicates the problem?

When I run this:
Python code:
# Make a bigish dict indexed by a tuple of strings
probs = {}
for i in range(1000):
    probs[(f"thing1_{i}", f"thing2_{i}", f"thing3_{i}")] = i

def profile():
    """ The code we're interested in timing """
    fake_work = 0
    for _ in range(200_000):
        # Dict lookup, pretty sure CPython doesn't cache anything so every call does the full lookup
        fake_work += probs[("thing1_300", "thing2_300", "thing3_300")]
    return fake_work
%timeit profile() is telling me it's only tens of ms to run profile(). Doesn't seem like the dict lookup should matter much for runtime, unless your probs is a lot bigger than 1000 things

Adbot
ADBOT LOVES YOU

Foxfire_
Nov 8, 2010

A less dramatic rearranging. This is trading memory for time by getting rid of python objects and python computation.


1x python saying "Numpy, please generate me 400 choices using these probabilities' is much faster than 400 x python saying "Numpy/random.choices, please generate me 1 choice using these probabilities"

Python code:
In [86]: def rearrange(num_trials_per_key, include_py_lookup):
    ...:     probs = {}
    ...:     keys = []
    ...:
    ...:     for i in range(5000):
    ...:         keys.append((f"condition1_{i}", f"condition2_{i}", f"condition3_{i}"))
    ...:         probs[(f"condition1_{i}", f"condition2_{i}", f"condition3_{i}")] = np.array([0.25, 0.25, 0.25, 0.25])
    ...:
    ...:     possible_options = ['option 1', 'option 2', 'option 3', 'option 4']
    ...:
    ...:     for key in keys:
    ...:         probabilities = probs[key]
    ...:         chosen_option_indices = np.random.choice(
    ...:             a=np.arange(len(possible_options)),
    ...:             size=num_trials_per_key,
    ...:             p=probabilities,
    ...:         )
    ...:
    ...:
    ...:         if include_py_lookup:
    ...:             for option_index in chosen_option_indices:
    ...:                 # Do something with the choice
    ...:                 _ = possible_options[option_index]
    ...:

In [87]: %timeit rearrange(400, True)
218 ms ± 4.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [88]: %timeit rearrange(400, False)
117 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [89]: %timeit rearrange(10_000, True)
3.4 s ± 66.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [90]: %timeit rearrange(10_000, False)
1.06 s ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It's still dominated by python lookup, not the random choice numpy part. Doing 5000 x 10_000 python list index lookups takes longer than doing 5000 x "numpy generate 10_000 random choices".
Python is very, very slow. If you're doing numerical computing, you want to make sure as little implemented-in-python code as possible runs.


e:
the matrix thing Biffmotron is doing is essentially the same idea. Doing it like that is theoretically worse/slower (needs more scratch RAM, worse cache locality), but it moves the bulk of the executing code from python to numpy-implemented-in-C and that time savings more than makes up for doing the calculation suboptimally.

The best-in-abstract way to do it would be to do it like you originally had it with loops that generate, use, and discard state as soon as possible so that the state is most likely to fit in cache, but python-slowness and diffuseness (PyObjects are individually allocated on the heap and aren't necessarily close to each other for caching) outweighs that

Foxfire_ fucked around with this message at 22:50 on Aug 12, 2022

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply