|
KingNastidon posted:Just getting started with pandas, so this is probably a really naive question. I'm trying to replicate a really basic algorithm from VBA for oncology forecasting. There is a data file with new patient starts and anticipated duration of therapy in monthly cohorts. Each new patient cohort loses a certain amount of patients each time period using an exponential function to mimic the shape of a kaplan-meier progression curve. The sum of a column represent the total remaining patients on therapy across all cohorts. Below is the code I'm using: Python is extremely slow and extremely RAM hungry. This is a consequence of aspects of its type system and design. The interpreter has no way to know what to do for any operation without going through a dance of type lookups. It also can't prove that those operations are the same from loop to loop, so it has to redo all that work for every element. The trick to get acceptable performance is to arrange it so that none of the actual computation is done in python (so it is fast) and none of the data is saved as python objects (so it is RAM efficient). You essentially have a bunch of python glue code that is telling external libraries what to do in broad strokes, with the actual computation done by not-python. The main library you use to do this is numpy. Pandas is using numpy arrays as the storage for its DataFrame and Series values A line like: code:
Something like: code:
A piece of code like this: code:
Basically any time you loop over your data, the performance will be crap. Your options are: 1. Rearrange your code so that it's asking numpy to do bulk operations instead of doing them in python. This can be hard/impossible to do and isn't usually how people think about their problems. 2. Write your algorithm that's hard to express in C so that it's a new bulk operation you can do. Alternatly, dig through libraries to see if someone else has already done this for the algorithm you want. 3. Use numba. numba is a python-like sublanguage where enough python features are removed and types are locked down that it can be efficiently compiled. Depending on your algorithm rewriting it in numba could require zero work or could be more annoying that redoing it in C. (The example you posted would be on the zero work end)
|
# ¿ Mar 31, 2017 07:59 |
|
|
# ¿ May 4, 2024 02:34 |
|
Malcolm XML posted:This has little to do with python being compiled or not: pandas has a ton of overhead if you don't do things the correct way Huh, I was expecting the pandas overhead to be much smaller. Some testing shows that it goes: vectorized numpy > vectorized pandas >> iterative numpy >>> iterative pandas An iterative solution in something with known types (C/Fortran/Java/numba/Julia/whatever) will still be faster for complicated calculations than the vectorized versions since the vectorized ones destroy all cache locality once the dataset is big enough (by the time you go back to do operation 2 on item 1, it's been booted out of the cache), but you can still get enough speedup to move some algorithms from unworkable to useable. Python's speed problems mostly don't have to do with it being compiled in advance or not. They're consequences of how it works with types and the stuff it lets you change. It's basically impossible to JIT compile stuff like you would in Java because you can't prove that a loop isn't doing things like monkey patching functions or operators at some random loop iteration in the middle. The interpreter has to execute what you put down naively since it can't prove that it's safe to take any shortcuts. Timing results: code:
code:
|
# ¿ Apr 1, 2017 05:59 |
|
Linear Zoetrope posted:Do you need to touch the reference counters or GIL if you want to spawn a bunch of threads from the C side, call Python functions that return PyObjects, and forward the PyObjects between the threads? To be clear, only one thread will touch a given PyObject * at a time. You can pass the pointers themselves amongst your threads however you want. But at most one thread may be invoking Python code at any time and theoretically it should own the GIL. You can't be accessing multiple PyObject's simultaneously, even if any particular object is only accessed by one thread. I think you avoid GIL stuff if you can guarantee: - your threads will synchronize themselves to avoid simultaneously trying to run Python code - no threads are ever created from Python The interpreter would be running in its single-threaded mode where the GIL doesn't actually exist and there's only one possible value of it's global ThreadState structure in play. CPython doesn't use any OS thread-local variables, so it shouldn't care what OS thread is actually running its code, as long as there's only one at a time.
|
# ¿ Sep 8, 2017 05:28 |
|
creatine posted:Does anybody know of a python3 module that let's you work with .FCS files? All the ones I can find are 2.7 only https://github.com/eyurtsev/fcsparser claims to do it, I've only ever used FlowCytometryTools (which is 2.7 only) Depending on what exactly you need, it's not too horrible to write something yourself to yank out the data and compensation matrix.
|
# ¿ Sep 28, 2017 06:16 |
|
pudb's my favorite if you're on unix
|
# ¿ Nov 30, 2017 06:33 |
|
salisbury shake posted:I'm just learning to use NumPy and had to poke around the docs to do what I want. While the above code finds the solution correctly, I'm unsure if I'm using NumPy efficiently or even canonically. Outside of a code-golf type fun challenge, there's not really a reason to vectorize this kind of thing. It will make it less readable, slower, and more memory hungry than a loop version. The problem is naturally solvable and readable with a loop, and loops are what computers are good at. The problem with doing a naive loop is that Python is very, very slow. To get your math to run at a reasonable speed, you need to get the computation out of Python. You can do that by using NumPy ufuncs so that the only bit of Python code that runs is invoking the ufunc a few times and having each ufunc [which is implemented in C] do lots of math. But that means you have to figure out a way to vectorize instead of doing it the natural way. An alternative is to use numba. It provides a decorator that you can apply to your Python function that says "instead of executing this function in the Python interpreter, just-in-time compile it into normal instructions the first time it is used and automatically generate the boilerplate to pass data back and forth". It can only work with a subset of the Python language, but it's often good enough. Taking your get_fabric function and massaging it a bit (I don't remember offhand if namedtuples are supported in numba): code:
Bonus thoughts for refining your solution: - With your solution, what happens if you have a million elves on a 10x10 fabric? - Can you avoid having to allocate the length x length array? i.e. can you make it work if the fabric is 10000000 x 10000000 as long as the claim list is short?
|
# ¿ Dec 7, 2018 10:24 |
|
You can look at the dictionaries themselves and see where the entries are:code:
code:
*this is a little bit a lie. There are some other names that are somewhere else. That's why you don't see an entry for "__class__" in test and test2's dicts even though you can access "test.__class__" without getting an error. And there's a few more places that get searched for base clases.
|
# ¿ Dec 12, 2018 05:43 |
|
Installing opencv is kind of a clusterfuck and I wouldn't really expect pip-ing it to work. From vague memories of the last time I did that, pip-ing it only installs the python wrapper, not opencv itself. cv2.pyd is the dll with the python wrapper. It implicitly links to all the actual opencv dlls. Probably one or more of those is missing or not locateable. Take dependency walker and open cv2.pyd with it. It will show you everything cv2.pyd tries to load, everything its immediate dependencies tries to load, etc... Every implicit load has to be satisfied or cv2.pyd will fail to load. Delay-loaded ones don't have to be satisfied (delay-load = the program tries to LoadLibrary() in some code path, which (A) may never actually be executed and (B) it can handle LoadLibrary() returning an error)
|
# ¿ Jan 5, 2019 06:35 |
|
Write it back out to disk as a big binary array, memmap it, and let the OS swap it in and out as needed?
|
# ¿ Feb 5, 2019 07:58 |
|
SirPablo posted:Not sure I follow this. I tried a memmap but seemed to just hose my machine. If you try to read 50GB into memory, you'll get an out-of-memory error since you don't have that much RAM (or if you have 50GB+ of swap enabled, you'll effectively get a crappy memmaped behavior). Memory mapping it is telling the OS "When I try to access something in this region of memory, go read the surrounding little area from disk and pretend that was in the memory+cache it so that if I access it again you don't have to go all the way to disk". If you touch a new area and there's no free RAM to use to cache it, some other disk-backed page will get kicked out of RAM according to the OS's policies and the next program that touches that will have to wait while it reads in from disk (and boots something else). If you're accessing the 50GB in a random unpredictable way (so that every read/write is likely to have to be loaded from disk and boot some other page to make space), there's not really anything you can do to make it run well besides get more RAM. But if the access pattern is sequential or has locality like most do, accesses will tend to be to regions that are already cached.
|
# ¿ Feb 9, 2019 08:09 |
|
Sad Panda posted:Poorly explained. I'll try again. It's been awhile since I've had to care, but if I'm remembering right, when you drop your remote desktop session, that desktop stops existing and all the OS resources that were allocated for it get freed. A desktop is a thing in Window's object model and participates in a bunch of stuff for both security and resources. Example: the administrator password prompt runs on its own desktop. You can't send or receive messages from windows on it from the default interactive desktop, they don't share clipboards, etc... I think video memory will also be deallocated, so probably your immediate problem is that there isn't any buffer to screen grab from anymore. When you log back in, memory gets allocated again and all windows get told to repaint themselves.
|
# ¿ Mar 20, 2019 05:53 |
|
You won't do better than the C windows API way of doing it since that's underneath everything Python does [and Python's underlying thread scheduling is terrible]. That will still give ~20ms jitter if other things want CPU and don't yield sooner, so if that is too much, you'll need to do something else. There you would do it by using CreateWaitableTimer to create a timer object SetWaitableTimer to configure it as periodic with some beat WaitForSingleObject to sleep until the timer expires You get bonus priority from the scheduler because you're waiting on IO. You could ctypes it from python. I don't think there's a standard way to sleep for anything but a relative duration, so otherwise the best you could do is calculating an approximate wake time and doing a relative sleep (then either spinning until its time or accepting whatever inaccuracy is left). But if you have a microcontroller, just use that. It is much better at doing things with consistent timing than a big computer.
|
# ¿ Mar 30, 2019 03:45 |
|
TwystNeko posted:Okay, so I've got this ESP32 working rather well with Micropython, controlling a MAX 7219-based LED matrix, and capacitive touch sensing for buttons. However, I need some help with interrupts.I think. Set up two ISRs: - On the GPIO activating, set a flag then return. - On the timer expiring, set a different flag then return (not really needed since you can infer it from the first one) Forever: - Update the display with the picture for this state - Load the timer with the time till the next frame and start it - Sleep the chip till there's an interrupt - Mask interrupts - Disable timer - Copy global flags to locals and clear global ones - Unmask interrupts - Decide new state based on whether it was timer or button that woke you up Depending on hardware, you might need to do something fancier to debounce the GPIO
|
# ¿ Apr 7, 2019 07:34 |
|
You're writing to stdout and it's line buffered by default. It won't flush to the OS unless a buffer fills up or you write a newline. use print(whatever, flush=True)
|
# ¿ Jul 28, 2019 08:57 |
|
Windows also accepts either direction
|
# ¿ Oct 12, 2020 20:57 |
|
That looks like correct behavior? On windows, the paths "..\this\that.txt" and "../this/that.txt" both mean "up one directory, into a directory named this, file name that.txt" On linux, the path "../this/that.txt" means that, but the path "..\this\that.txt" means "file named ..\this\that.txt" in the current directory. ..\this\that.txt is a legal posix filename because posix filenames are terrible (newlines and nonprinting UTF-8 characters are also legal posix filenames) If you want to make that be interpreted as a path with directories on linux, you will need to either explicitly tell it to interpret it as a windows path or manipulate the string into a posix path Foxfire_ fucked around with this message at 20:08 on Oct 13, 2020 |
# ¿ Oct 13, 2020 20:05 |
|
If you're composing the paths, just always use forward slashes since windows will also accept that. If you're accepting the paths from users, you'll need to decide how to interpret what they mean when they enter "this\that" on linux since it technically means filename but they probably didn't mean that if they are good upstanding people
|
# ¿ Oct 13, 2020 20:21 |
|
That sounds large enough & you caring about performance enough that using python is a bad idea. Python is very, very slow. If you're doing large numerical stuff, you can really only use it as a glue language to connect together pieces implemented in something else. Backing it with rust or C, or writing in a python-adjacent language like Cython or numba seem like better ideas to me. How many times are you going to use this thing and how many points are you going to classify? If it's not that many, a naive iteration through all points for every lookup in not-python might be the fastest thing considering your time to implement it. 20,000 points is not that many. Like making up some numbers and calling each point 100 doubles long, that's only 16MB of memory to go through per classification.
|
# ¿ Oct 15, 2020 07:26 |
|
The original reason for Python not having static types is that it makes writing the interpreter easier. In the interpreter, every single python object is the same type (PyObject) and supports exactly the same set of operations & data. Python grew out of a one man hobby project and lots of its warts stem from that. It's predecessor language (ABC) did have a static type checker (and so did Algol, FORTRAN and C, all of which are much older) Dynamic types also make objects changeable at runtime, which is useful for modifying stuff for tests or messing with other peoples code. Every function/attribute access is internally "Look up a value in a map of string -> PyObject, throw AttributeError if that key doesn't exist", so you can trivially rewire those mappings at runtime. Also people like having less stuff to type when working interactively or doing little shell script type things and those were python's main target area for most of its history. Guido van Rossum was resistant to adding type hints up until he moved from academia to having to maintain large codebases.
|
# ¿ Oct 17, 2020 19:55 |
|
salisbury shake posted:The above is not only slow, it maxes out my CPU at only something like 150MB/s. I would not be surprised if you are just running into python being slow as a limit. You are doing: - Lookup what 'next' means on the stdin.buffer object - Call it, it reads data into an internal buffer - Allocate a new python object and copy the bytes into it (I think you are at least skipping the bytes -> UTF-8 conversions the way you are doing it) - Lookup what 'write' means - Call it, copy the data into another internal buffer (occasionally flush that buffer to the OS) - Deallocate the python object - Loop The shell pipe redirections are just renaming things so that the output of the first and the input of the second are the same thing. You can't introduce yourself in the middle of that because there is no middle, they're one file with two names. If you want to insert your program into a pipeline and don't care about anything besides counting, the fastest way will be to use os.open() / os.read() / os.write() and a decent sized buffer size to skip as much python as possible. You'll still be allocating&deallocating python objects, copying data, and doing dictionary lookups for every chunk though, so it'll be slower than a C one Mirconium posted:For multiprocessing pools, should I be spawning one any time I want to async-map over an iterator, or can I just create a single pool for the arena that I am working with and then summon it by reference any time I need to go over an iterator? I'm not planning to actually asynchronously run anything, I just want to parallelize some iterative computations. Be aware that by default, python on unix multiprocessing does invalid things. It will fork() to create copies of the existing process, then try to use them as pool workers, which violates POSIX. It will usually happen to work for most common libc implementations as long as absolutely everything in the process is single threaded. multiprocessing.set_start_method('spawn') will fix it.
|
# ¿ Nov 11, 2020 04:19 |
|
Tayter Swift posted:I have a 40 GB CSV, about 50 million records by 90-ish fields. I need to sort it by fields VIN and Date, and remove duplicated VINs by most recent Date. My machine has 64GB. Turn it into a numpy record array, then do the sorting via numpy. If the de-duplication is not trivial to do via numpy interface, write it in numba so you can do a natural iterative thing instead of trying to hammer it into array operations with many temporary copies. If still too big for memory, np.memmap() is easy and won't be that much slower. e: Type hint looks correct to me, except that it should be Optional[Union[int, range, Iterable[Union[int, range]]]] if you're allowing None like the default suggests. Not surprising that tooling doesn't understand complicated things I read it as foo is one of: - None - int - range - Iterable where each entry is either int or range Foxfire_ fucked around with this message at 23:10 on Feb 22, 2021 |
# ¿ Feb 22, 2021 23:03 |
|
OnceIWasAnOstrich posted:edit: I'm having a hard time figuring out whether the PEP/mypy actually allow for covariant type lists. I think maybe the type system doesn't allow that for mutable types? I think maybe PyCharm sees list, the type system specifies lists can only be invariant, so it sees a List[int], converts to Iterable[int] and calls it a day? I don't think it's thought through enough to actually have specified behavior outside of whatever some particular tool assumes. Is Optional[int] a type or a statement about two different types? No actual object will ever have a type of Optional[int]. Some names might variously refer to None or an int, should the name get a 'type' (distinct from the object type) that somehow logically gets attached to it? From CPython's point of view, none of the type hints have anything to do with actual types, so there's nothing there to guide anything. PEP484 has code:
e: That's maybe a bad example since you could read it as requiring every tuple in the iterable to have the same number type. List[Any] does definitely show up in the PEP though Foxfire_ fucked around with this message at 00:31 on Feb 23, 2021 |
# ¿ Feb 23, 2021 00:18 |
|
CarForumPoster posted:Yea a db is the way to go though I also need to do fuzzy string matching which is a whole other issue but I suppose if I wanna do that I can query all 8M rows of 1 or 2 cols into memory the do my fuzzy matching in python slow as loving can be. Maybe make a AWS Lambda function to parallelize a bunch of these but have the Your underlying problem is more that pandas in general is trying to optimize for person-time writing code at the expense of being very slow and using lots of RAM. Generally that's a good assumption, but not here if you have too much stuff and are going to run it a lot. pandas is also horrible at strings since the underlying actual things being stored is arrays of PyObject pointers to full python objects elsewhere. 8,000,000 rows x 50 cols isn't that much data. Like if each string is 64bytes, that's still only about 25GB, which is only going to take a second or two to go through if they're already in RAM. I would: - Move data to plain numpy arrays of string_ dtype (fixed length, not python objects) - Do the fuzzy string match in numba. Levenshtein distance implementations are easily googleable Ought to take less than a second to compute distance between the query and every element in a column Wouldn't use a database since you won't be able to do the fuzzy matching on the database side, and you want to avoid having to construct python objects or run python code per element.
|
# ¿ Mar 14, 2021 03:17 |
|
Protocol7 posted:You know, I do a lot of stuff with Pandas, but after reading some of the recent posts in this thread... It is fine and good if the problem you are trying to solve matches up with what it is designed to do. It is a good tool for interactively viewing/manipulating small amounts of data. It is not a good tool for manipulating multi GB datasets or doing things you're going to run a million times and you care about performance. If you're only going to run something once and it still only takes a few seconds, doing it 100X slower is a good trade for 5 mins of your time. Protocol7 posted:For example, did you know if you are copying items from one spreadsheet to another, that you need to specify pandas.set_option('display.max_colwidth', SOME_LARGE_VALUE) or else it will just silently truncate strings longer than 100 characters by default? Because I sure didn't.
|
# ¿ Mar 15, 2021 20:11 |
|
Jose Cuervo posted:I am on board with doing this, but I am having trouble wrapping my head around how I would accomplish this. Do I define a new column, say pid as the primary_key for the Patient model as below, and use pid as the foreign_key in the other models as below for the Diabetic model? You have a meaningless-outside-your-application patient ID that uniquely identifies a patient. All of your deidentified tables reference that for knowing what patient a row is about. Then you have a table that maps from the the patient ID to all the PHI about that patient (MRN, names, DOB, ...). That table could be either inside the same database or elsewhere if you wanted to segregate the dataset into PHI and deidentified parts. You might also need multiple PHI tables for some stuff, like if you want to store multiple different-system/different-site MRNs for one actual person. For each table in the schema, you should be able to classify it as PHI or not and shouldn't be mixing PHI and not-PHI together in the same table. If you've done stuff right, you ought to be able to hand someone all your non-PHI tables and they'd be able to follow information about a particular subject, but not have any PHI about them
|
# ¿ Mar 23, 2021 21:25 |
|
What you're trying to seems generally right to me. Would toss in a unique constraint on pid if you want to use (MRN, HealthSystem) as the primary key. Also, MRNs probably should be strings, not integers. "012345" and "12345" are probably distinct MRNs. For the autoincrement, like Wallet said, SQLite has weird autoincrement behavior and specific workarounds in sqlalchemy. Since you don't particularly care that the values are ascending integers as long as they're unique, a reasonable portable thing would be to externally make a UUID and use that. Or you could beat sqlite into giving you a unique integer, but you'll have something somewhat db engine dependent. e: also be aware that foreign key constraints in SQLite by default do nothing. Foxfire_ fucked around with this message at 05:07 on Mar 25, 2021 |
# ¿ Mar 25, 2021 05:04 |
|
OnceIWasAnOstrich posted:You can create a multiprocessing.Process with the target= argument as the callable you want to run, ideally using a Pipe or Queue to feed your data back to the main process. You then call .start() and then .join(timeout) with whatever timeout you want before checking the Pipe/Queue and calling it a day. If your underlying callable releases GIL you can even do this with threading.Thread instead. It's default behavior is to fork() to make a copy of the process, then keep using it without calling exec(). This has always been forbidden by POSIX, but is commonly done. It will generally work fine as long as absolutely everything in the process holds no locks at the moment the fork occurs, which you have no way of assuring that is true in general (you can't easily prove that some library doesn't make a thread that it uses internally)
|
# ¿ Jun 10, 2021 04:18 |
|
That doesn't actually kill the process, it just stops waiting for the result:Python code:
code:
mr_package posted:I was also surprised that you basically cannot easily kill threads/processes that you launch. There are third-party libraries and what appear-- to me-- to be very hack-y approaches to dealing with this to be found on SO. Perhaps they are fine, perhaps this is just the way it is done. This is my first look at concurrency, do other languages handle this pretty much the same way or is it Python-specific? I'm thinking for example if you were writing something in go or rust or C# or whatever and you wanted to terminate a process early, is it easier? It is possible to safely abruptly kill a process if you have appropriate permissions, but it's not something I would expect a processing pool library to make easy. Again, your code should have a way to gracefully cancel (returning the worker to the pool), not forcefully rip down a piece of the pool infrastructure and hope the pool implementation heals.
|
# ¿ Jun 11, 2021 03:43 |
|
mr_package posted:What does the correct way to do this look like? I have a piece of code I know might take too long. How do I set up the call to it so that it terminates in an appropriate time? This is not a standard thing? For lots of stuff that wasn't ever intended to be used asynchronously, there won't be a way to do that. gethostbyname() doesn't have a way to cancel it. What you can do is make the code that launched it not wait anymore & make it so that once the slow call does return, your code ignores the result and just exits: Python code:
mr_package posted:As an aside, does anyone know offhand what happens when you call subprocess.run() with a timeout? Does the process continue to run in the background and then exit normally however long it takes? I'm wondering if the caller just ignores it and moves on or if it actually communicates to the process that it should quit/cancel (presumably the process would be responsible for then handling this in a mostly graceful way).
|
# ¿ Jun 12, 2021 01:38 |
|
The first lesson of using Python on bigish datasets is "Don't". It's slow and python objects have huge amounts of overhead. Only use python as a glue language to manipulate non-PyObject things with not-python implemented code (this is an old i5-2500K and the dataset is only ~4GB) e: If you want a python list of python integers at the end, it's O(x) to go back to that. It'll get enormous though because each 4 byte integer is going to turn into a full PyObject. Doing the 'ndarray of a billion integers' -> 'python list of a billion python integers' conversion is slower than generating and sorting the ndarray. Foxfire_ fucked around with this message at 05:39 on Sep 10, 2021 |
# ¿ Sep 10, 2021 05:26 |
|
QuarkJets posted:That code is pretty hard to read Solid state disk is much slower than main RAM Main RAM is much slower than L3 cache L3 cache is slower than L2 cache L2 is slower than L1 L1 is slower than CPU registers Copying a file is essentially no CPU work if the underlying copy is implemented sanely. The CPU is telling the disk controller "Copy this sector of data to main memory, interrupt me when you're done", taking a nap, telling the disk controller to copy in the other direction and napping again, then repeating that till everything's copied. Having more or faster CPUs won't help because they're napping 99% of the time anyway It's like you've got one guy with a shovel and ten people standing around telling him where to dig next. It's not faster than one guy with the shovel and one boss. e: also, unrelatedly, you have to do import multiprocessing; multiprocessing.set_start_method('spawn') on Unix to get not-broken behavior. The default ('fork') violates fork()'s specification and may randomly deadlock depending on what other code in your process is doing. Foxfire_ fucked around with this message at 04:36 on Oct 5, 2021 |
# ¿ Oct 5, 2021 04:26 |
|
There is no expectation that anything pickled is unpicklable when the version of anything changes. It might work, might give you an error, and might silently give you corrupted data. Pickles are not suitable for a nontransient serialized format
|
# ¿ Apr 1, 2022 19:30 |
|
The pickle protocol may be backwards compatible (that's broken a couple times in Python history, but at least those were considered bugs), but your likelihood of getting usable data back from a pickle that's sat somewhere while a couple years of people and documentation turnover went by isn't that great unless you're doing something like building a manual "convert to/from to a dict of language primitives" pre serialization step. And at that point you might as well be using either a self-describing serialization format, or a binary format that's less 'do everything' but better at the things your application cares about specifically. (mostly I'm bitter about scikit having no serialization format besides pickles and having to deal with people's broken files from some unknown past version)
|
# ¿ Apr 2, 2022 21:34 |
|
D34THROW posted:I was today years old when I discovered float.as_integer_ratio.
|
# ¿ Apr 4, 2022 19:41 |
|
D34THROW posted:Fractional inches. For the cut list, a straightforward debugging method is to print out stuff and see where what actually happens splits from what you think happens: code:
The else path on that doesn't make sense to me glancing through, the piece doesn't fit into the freespace, but it's messing with the solution and remaining freespace
|
# ¿ Apr 4, 2022 21:17 |
|
I'm not 100% sure I'm following what you're doing, but I think you have each timespan starting out on its own and you keep merging them until the gaps are bigger than 10mins? Disjoint set forest is a data structure that can do that efficiently. For N things, they are grouped into sets with each object belonging to exactly one set. Finding which set something belongs to and merging sets is amortized ~O(1) (But depending on your problem size & how often you will do it, it may not be worthwhile to implement & test it vs just waiting out a trivial python version or doing a trivial implementation in some faster non-python thing (either python-adjacent numba/numpy, or a completely separate language. Or find someone whose already implemented the data structure in python)
|
# ¿ Jul 28, 2022 19:16 |
|
Do you intend to count a line like "yellow yellow" As one yellow count? Also, do you intend it to be case specific?
|
# ¿ Aug 5, 2022 16:09 |
|
Is it the choices call that's slow or the rest of it? Non-python code (nump random, and numpy generally) will run much faster. The typical way to do numerical computing with python is to use python as a glue language connecting non-python code (number or numpy)
|
# ¿ Aug 11, 2022 18:31 |
|
Can you set up some smaller dummy thing that duplicates the problem? When I run this: Python code:
|
# ¿ Aug 11, 2022 21:29 |
|
|
# ¿ May 4, 2024 02:34 |
|
A less dramatic rearranging. This is trading memory for time by getting rid of python objects and python computation. 1x python saying "Numpy, please generate me 400 choices using these probabilities' is much faster than 400 x python saying "Numpy/random.choices, please generate me 1 choice using these probabilities" Python code:
Python is very, very slow. If you're doing numerical computing, you want to make sure as little implemented-in-python code as possible runs. e: the matrix thing Biffmotron is doing is essentially the same idea. Doing it like that is theoretically worse/slower (needs more scratch RAM, worse cache locality), but it moves the bulk of the executing code from python to numpy-implemented-in-C and that time savings more than makes up for doing the calculation suboptimally. The best-in-abstract way to do it would be to do it like you originally had it with loops that generate, use, and discard state as soon as possible so that the state is most likely to fit in cache, but python-slowness and diffuseness (PyObjects are individually allocated on the heap and aren't necessarily close to each other for caching) outweighs that Foxfire_ fucked around with this message at 22:50 on Aug 12, 2022 |
# ¿ Aug 12, 2022 20:41 |