|
KICK BAMA KICK posted:Just curious, is there a long story short on why tables are hard? With OCR'd data all you have are the coordinates of the text chunks, the text itself, and possibly some other visual cues like cell borders that you can extract from the image. All of these will be confounded by standard OCR issues--you can never trust that the text will be correct, and if the image isn't clean you can expect that there'll be random visual noise misinterpreted as characters all over the place. Unless you're able to split along cell borders you find in the image, you have to determine the cell size based only on the coordinates (and possibly contents) of the text. Often you won't have all or any cell borders--e.g., punished milkman's images have horizontal borders but not vertical borders. Consider that cells can have different widths and different heights, that text can potentially run immediately up to any edge of the cell or such that one cell's text ends exactly one space or line break's distance from where the next cell's text starts. Consider also that some cells may span multiple columns, and some cells may span multiple rows, and some may do both. If you're dealing with a lot of well-formatted, standardized tables or something like standardized forms, a common approach is to create templates that basically define where you can expect to find each piece of information. With enough training data, you can build ML models that can support more loosely-structured forms with a similar approach. If you're dealing with less but still somewhat regularly formatted data, you can build out more specific parsers based on your own domain knowledge. If you're dealing with arbitrary data, well, good luck.
|
# ? May 11, 2019 05:52 |
|
|
# ? May 27, 2024 01:10 |
|
dougdrums posted:Hahaha, ok. It certainly does matter. Everyone is accepting of your use of a dict as a switch statement, it's fine; I was just explaining why some people don't hold themselves to the standard you've created. Don't try to flip the script; you're the one who said that no one should ever use elif, no one cares if you choose not to use them. And there must be some miscommunication here because really, no, if it costs no additional effort then it's fine to want to write code that runs quickly the first time QuarkJets fucked around with this message at 06:52 on May 11, 2019 |
# ? May 11, 2019 06:47 |
|
mr_package posted:How are you organizing your @dataclass objects? If each instance is essentially a row from a database, is there a good way to order/organize them? e.g. if I just shove everything into a List I need to iterate through to try and find the specific instance I'm looking for. There's probably a smarter way? If you think about it though, the primary key in a database row is part of the organisational system, so it makes sense to have it as the key in the dictionary. It's what you're using to identify and reference a particular data object And sometimes that primary key is also part of the data object, like an ISBN, instead of just an internal identifier the database is using. So in that case, it makes sense to also have it in the object you're storing in the dictionary. Chances are the thing that ends up accessing it will want to make use of that attribute, but won't necessarily have access to the dictionary. And maybe you might want to change that implementation anyway - but the data should stay the same, right? Store what you need to store So yeah you're storing the same piece of data twice, but imo that's ok, it makes sense sometimes! You'll probably want to write something that enforces that consistency - maybe a class that holds the dictionary internally, and adds stuff by automatically reading the key from your data and using that to add an item, so you never interact with the dict directly. It depends if end users care about the ID or not - I'm guessing they supply it for lookups? (If you really care about this redundancy you could have a fetch method that adds the key to the returned data, so you store it separately but the user gets the whole thing) Also don't forget if the IDs are in sequential order you can just index into a list instead, if that works for you
|
# ? May 11, 2019 07:03 |
|
Original post deleted. Edit: For that OCR thing, someone posted a link to Google’s web API where they have a way you can test it out with just uploading a picture. https://cloud.google.com/vision/ I tried it out with that nutrition picture and here’s one thing it considers as one “text block” quote:7 % S u g a r s / S u c r e s 1 4 g P r o t e i n / P r o t é i n e s 2 9 C h o l e s t e r o l / C h o l e s t é r o l 0 m g S o d i u m 5 m g P o t a s s i u m 2 0 0 m g C a l c i u m 2 0 m g I r o n / F e r 0 . 5 m g V i t a m i n A , V i t a m i n e A 1 0 m o g 1 V i t a m i n C V i t a m i n e C 1 4 m g 1 6 % T h i a m i n e 0 . 0 5 m g And it looks like the first 7% is offset and belongs to the line above it. Anyway my take on this is that if this is the best that Google can do then there probably isn’t a better solution? You’d have to invest some time sanitizing and reformatting the text it seems. Boris Galerkin fucked around with this message at 08:41 on May 11, 2019 |
# ? May 11, 2019 08:31 |
|
punished milkman posted:I think splitting the images up into sections with OpenCV and then extracting/parsing the text is what I'll need to do. This is way more involved than I thought... loving tables I put that example through it and got this as output: code:
It'd probably be easier to write a parser sketched out like ((text label) (decimal number) ([choice of unit]) newline)+ ((digits+) % newline)) per row, with a bit of bespoke edits to accept "m9" as "mg" and deal with the inconsistent whitespace. You'd have to do some sanity checking to catch the parts where you could possibly have 0.05mg of Thiamine being 49% of the DV. But you could also just totally sidestep that issue by calculating the unit measurements to percent DV from some source, and outright ignore the given DV values. Azure also gives bounding boxes, but for my recipe project I just used them to ensure that the whole recipe was being scanned. dougdrums fucked around with this message at 11:45 on May 11, 2019 |
# ? May 11, 2019 10:57 |
|
Whoops ...
|
# ? May 11, 2019 10:57 |
|
Consider using enums in place of booleans. This article is targetted at Rust, but is relevant to Python as well. The article explains it well, but enums can be thought of as a generalization of booleans that 1: can have more than 2 values; (useful even when your situation only includes 2 values, in case it expands later) 2: adds a clear semantic meaning. Caveats: Python's enums may not be performance-equivalent to boolean, it doesn't include exhaustive pattern-matching (See above discussion about switch statements/if-else etc), and the syntax for creation is slightly more verbose. Despite these, I think it's worth considering an enum, whenever your instinct points to boolean. Dominoes fucked around with this message at 15:42 on May 11, 2019 |
# ? May 11, 2019 14:20 |
|
edit: delete
|
# ? May 11, 2019 15:43 |
|
I’m skeptical, although the match behavior in Rust does add an extra compelling reason. To me a) that doesn’t necessarily represent a typical truthiness state (there are plenty of times you’ll never conceive of adding a 3rd state because it either is or isn’t), and b) you’d probably want to be passing in some ‘hasTimedOut’ bool variable instead of a boolean literal, right?
|
# ? May 11, 2019 16:07 |
|
baka kaba posted:So yeah you're storing the same piece of data twice, but imo that's ok, it makes sense sometimes! baka kaba posted:It depends if end users care about the ID or not - I'm guessing they supply it for lookups?
|
# ? May 11, 2019 18:55 |
|
Well by "end user" I really meant whatever's calling it (should have said client). So say your web app goes "hey I need that list of languages" and your Python code goes "here ya go, each one has an ID". If the web app is doing something like displaying all the languages, letting the user pick one, and then passing the relevant ID back to your Python code, that ID is sort of temporary - it's like getting a menu of options and going "yeah, the third one", once you get the thing you wanted you don't really need to remember what its position was anymore, right? Whereas if you're gonna need to reference that ID again later somewhere, it makes sense to bundle it up as an attribute of the Language, since it's more of a persistent reference and the way you actually specify which object you want - it's acting more like a database, the primary key is part of the data (even if the user doesn't ever see it). It depends on exactly what you're doing, but by the sound of it the ID is probably gonna be a persistent reference, right? If the duplication bothers you, imagine you have Languages with an ID field, and you're storing them in a list. That's basically a database table. You can search it just fine, looking at the ID field, but for speed in a database you might want to build an index, which is basically another table that holds the key (so you now have a copy in two places) and a pointer to the data in the first table. That combo of index and data table is sort of what a dictionary is, so it's not that weird to have the key present in the lookup and the data itself! Like I said you could separate the key out for storage, and recombine it with the data when it's retrieved, but that's extra complexity that needs to be worth it baka kaba fucked around with this message at 19:24 on May 11, 2019 |
# ? May 11, 2019 19:21 |
|
baka kaba posted:If the duplication bothers you, imagine you have Languages with an ID field, and you're storing them in a list. That's basically a database table. You can search it just fine, looking at the ID field, but for speed in a database you might want to build an index, which is basically another table that holds the key (so you now have a copy in two places) and a pointer to the data in the first table. That combo of index and data table is sort of what a dictionary is, so it's not that weird to have the key present in the lookup and the data itself! I do something a bit like this in one of my work projects where I need a system of metadata that maps source data (from different sources) to target tables (there can be an arbitrarily large number of target tables per source). For simple lookups, I use the source as a dictionary key, and the actual metadata sits in NamedTuples, which, among other things, has an entry "source_table", which is the same as the dictionary key. Since I also use this metadata for orchestration (think for entry in metadata.keys()), this allows direct lookups (metadata[source_table]) as well as loops, and I still have all data available later for additional steps (I need the source table for logging, for example). This also allows me to write more functional code, since I don't need the whole dictionary or its key(s) while looping, only its value(s), since they contain everything I need.
|
# ? May 11, 2019 21:10 |
|
In the end, because JSON doesn't allow numeric ids I did end up writing the database export in a list-of-dicts format. However I also kept the primary key and wrote a simple function that can read it back in and convert to dictionary-of-dictionaries where the id is the key, e.g. {1: {"id": 1, "code": "en-US"}, 2: {"id": 2, "code": "fr-FR"}} If I find I'm not actually using it I may just leave it as a list after all, will depend how the rest of the code shakes out and whether it needs/uses it. I haven't worked in this pattern before so not sure if I'm actually gonna need the ids outside of the dict itself.
|
# ? May 13, 2019 23:48 |
|
What’s the best way to avoid headaches with write access to the program files folders when installing packages on windows while also not doing anything user specific? I think i can make a site-packages folder at an arbitrary location and add it to pythonpath for running code, but can I also get pip to use it transparently?
|
# ? May 14, 2019 15:01 |
|
Does PyCharm have an option to show names of arguments like IntelliJ does? In this picture, 'id' and 'name' are added automatically.
|
# ? May 15, 2019 19:53 |
|
I know I'm spreading myself around threads a bit, but this is a Python problem I think. I built TensorFlow 1.13 in a docker container. I'm pretty sure it was using the correct CPU type for the build. It built, and was packaged. I tried installing it in a venv. code:
What does it want from me?
|
# ? May 15, 2019 23:14 |
|
Can you just use the official docker container as your base? https://www.tensorflow.org/install/docker
|
# ? May 16, 2019 01:13 |
|
necrotic posted:Can you just use the official docker container as your base? https://www.tensorflow.org/install/docker A little over halfway down. https://www.tensorflow.org/install/source I think I did. Right now I seem to be building a CPU only Tensorflow using a Python 3 virtualenv I put together today. I just can't do the CUDA one. There's a breakage with no sufficient answers online Next, many hours of waiting. While I'm here, does ImageAI actually work properly? Or does it just hate me?
|
# ? May 16, 2019 04:01 |
|
Err is there a reason why you're building TF manually? There are official TF images on dockerhub for almost every combination of Python, CPU/GPU you'd be interested in.
|
# ? May 16, 2019 04:32 |
|
Sad Panda posted:Does PyCharm have an option to show names of arguments like IntelliJ does? Yes. But I don't remember what the option is called!
|
# ? May 16, 2019 04:34 |
|
Probably a version mismatch between the python in your venv and the version in your container Or, likewise, between your pip versions
|
# ? May 16, 2019 05:26 |
|
shrike82 posted:Err is there a reason why you're building TF manually?
|
# ? May 16, 2019 06:10 |
|
Thermopyle posted:Yes. But I don't remember what the option is called! They're called Parameter Hints but this doesn't seem to work for me. With it enabled, there's still no hints except if your cursor is editing the call with a popup. I'm not sure if this is an oversight or what. What's annoying is that the documentation documents it like it works: https://www.jetbrains.com/help/pycharm/viewing-reference-information.html#parameter-hints But there's nothing for PyCharm. Meanwhile, for C#, Java, Ruby, and so on: https://www.jetbrains.com/help/ruby/viewing-reference-information.html#parameter-hints https://www.jetbrains.com/help/idea/viewing-reference-information.html#parameter-hints https://www.jetbrains.com/help/rider/Inline_Parameter_Name_Hints.html#why-use-parameter-name-hints Though I do think Jetbrains has it confused with the popup version: https://youtrack.jetbrains.com/issue/PY-34026 I'm going to comment on that Jetbrains issue. It's been some time and I'm really not sure why Python doesn't have parity.
|
# ? May 16, 2019 06:36 |
|
General_Failure posted:To build without AVX opcodes, and to support CUDA3.0 I'm no help there but out of curiosity, what's your use-case? Some kind of embedded CV application?
|
# ? May 16, 2019 07:07 |
|
shrike82 posted:I'm no help there but out of curiosity, what's your use-case? Some kind of embedded CV application? Ouch That's my PC.
|
# ? May 16, 2019 10:01 |
|
My bad... Why not give Google Colab a try? It's free and would let you play with CV frameworks?
|
# ? May 16, 2019 11:11 |
|
crazysim posted:They're called Parameter Hints but this doesn't seem to work for me. Oh yeah, you're right. It does work in Javascript files that you're editing in PyCharm and that's what I was thinking of.
|
# ? May 16, 2019 16:26 |
|
I use logging and have calls to the logging setup with different levels throughout my program. Depending on the level they either get dealt with by 0 or more of file/screenshot/console/webservice. Debug mode is useful for me as the dev and logging the messages to console are enough for me to try to fix any issue if I'm there at the computer. Some people are however going to run the software to see if it works and report any bugs. The messages that it'd pop up now might be enough, but I'm thinking it'd be more useful to be able to produce much more verbose logs. Is there a good way to log pretty much all commands run and variables similar to as if you were stepping through only your code in the program? I assume it'll produce excess data, but the people that are going to be running it are the kind of person who reports a bug as "It stopped working" and are not good at verbalising either what they mean or what they were doing. To go with that, what do people use as a ticket submission setup? Thinking of creating something to log the problems they'll hopefully find as their idea mainly involves sending me a mountain of messages on Discord and that's nothing close to organised.
|
# ? May 17, 2019 07:44 |
|
QuarkJets posted:Tell me more about these antics Aight. So yeah Goonfleet. It wasn't really anything sanctioned by the leadership, just a few of us loving around in private. We basically noticed that the log files where stored in a python pickle format (Once you've seen it in hex, its not hard to miss) so I started up creating something that could be used to raise alarms when certain ships entered the system because the system would log gate activations. From there we worked out their MACHOnet implementation worked out much the same. There was encryption, but it was trivial to decode because the keys where in the client. So you could listen in on the wire protocol, and oh boy was it promiscuous. Turns out when your ship approached an enemy POS, the POS would bark out the shield password hashed, and if your hash matched its hashed, you where in. Not sure if it trusted the client, I think it was more the client and server kept parallel simulations. Anyway it turns out it wasn't THAT hard to decrypt the hashes on a GPU, since the salts where in plain text on the client. But we refrained passing the info on to leadership because it was pretty clear that its the kind of hijinx that could get people in lots of trouble, and we where more just curious than wanting to cheat. If memory serves me right we ended up passing the exploits to the devs. Oh, and at some point someone that wasnt us abused the remote code execution in python pickles to convinced the server to give up its entire source code which we got passed a copy of. It was..... surprisingly well written and I got a much better appreciation of the kinds of clusterfucks an MMO as complicated as that presented. There where a few exploits in there (Including one that let you make announcements as GM to a sector which got abused relentlessly, again not by us, for comical effect until the devs fixed it
|
# ? May 17, 2019 08:30 |
|
duck monster posted:Oh, and at some point someone that wasnt us abused the remote code execution in python pickles to convinced the server to give up its entire source code I'm going to try and remember that as a concrete example of why pickle is dangerous. Usually I can't ever think of any examples.
|
# ? May 17, 2019 18:59 |
|
duck monster posted:Aight. So yeah Goonfleet. You summoned me with this story. Hi.
|
# ? May 17, 2019 22:15 |
|
So when you say entire source code do you mean the entire game was written in Python, or that the exploit was able to locate the game’s git repository on their production servers and transfer out a tar.gz file of it?
|
# ? May 18, 2019 08:12 |
|
The entire game was written in python, client and server, with occasional interfaces to backend databases or other frameworks (e.g. graphics) as needed.
|
# ? May 18, 2019 12:42 |
|
Sounds like there was some drama at the Python language summit. The general idea: Python is loving up it's ecosystem with it's poor standard library. I like where Amber Brown is coming from there. In a lot of ways I think that if you're making Guido mad you've got good ideas!
|
# ? May 18, 2019 18:25 |
What's wrong with the standard library? Always thought there's plenty of useful stuff in it?
|
|
# ? May 18, 2019 18:44 |
|
Bundy posted:What's wrong with the standard library? Always thought there's plenty of useful stuff in it? The linked post sums it up pretty neatly I think. The standard library is full of worse versions of community developed things, but having a problem poorly solved in the standard library has a crowding out effect on better solutions.
|
# ? May 18, 2019 18:47 |
|
Bundy posted:What's wrong with the standard library? Always thought there's plenty of useful stuff in it? And it will only get better: http://charlesleifer.com/blog/new-features-planned-for-python-4-0/
|
# ? May 18, 2019 21:02 |
|
Yeah I'm glad the python community is realizing that Guido made some bad choices regarding the stdlib and keeping cpython a slow mess
|
# ? May 18, 2019 21:41 |
|
Lmao at Guido taking it persomally when someone with expertise makes constructive criticism about his baby and he storms out
|
# ? May 18, 2019 21:42 |
|
|
# ? May 27, 2024 01:10 |
|
punished milkman posted:Anyone have any package suggestions for extracting tables of data from image files (.png/.jpg) ? I tried using Tesseract/pytesseract and while it's doing a great job of detecting the text, the tabular aspect of it is totally lost and I couldn't find a straight forward path to processing tables with it. I've used Camelot with PDFs before, and it worked OK (at best), but I'm hoping to use something else this time around. Extortionist posted:This isn't an easy problem. If the images are fairly consistent you can try using one of the tesseract outputs that supplies word coordinates and do your own table determination based on the relative positions of words. It might also be useful to run the images through opencv first to extract the positions of the lines (possibly also removing them from the image, or splitting into several small images prior to OCR). Chiming in to say Docucharm is pretty good at this if theyre of a vaguely consistent format. For example I needed to typed and printed-> scanned reports into structured data. I did a compare of them to Textract (which I got access to early) and they were much better. Havent used either in about 4 months so cant say if either has made large progress. CarForumPoster fucked around with this message at 22:11 on May 18, 2019 |
# ? May 18, 2019 22:09 |