Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Extortionist
Aug 31, 2001

Leave the gun. Take the cannoli.

KICK BAMA KICK posted:

Just curious, is there a long story short on why tables are hard?
Basically what punished milkman said.

With OCR'd data all you have are the coordinates of the text chunks, the text itself, and possibly some other visual cues like cell borders that you can extract from the image. All of these will be confounded by standard OCR issues--you can never trust that the text will be correct, and if the image isn't clean you can expect that there'll be random visual noise misinterpreted as characters all over the place.

Unless you're able to split along cell borders you find in the image, you have to determine the cell size based only on the coordinates (and possibly contents) of the text. Often you won't have all or any cell borders--e.g., punished milkman's images have horizontal borders but not vertical borders.

Consider that cells can have different widths and different heights, that text can potentially run immediately up to any edge of the cell or such that one cell's text ends exactly one space or line break's distance from where the next cell's text starts. Consider also that some cells may span multiple columns, and some cells may span multiple rows, and some may do both.

If you're dealing with a lot of well-formatted, standardized tables or something like standardized forms, a common approach is to create templates that basically define where you can expect to find each piece of information. With enough training data, you can build ML models that can support more loosely-structured forms with a similar approach.

If you're dealing with less but still somewhat regularly formatted data, you can build out more specific parsers based on your own domain knowledge.

If you're dealing with arbitrary data, well, good luck.

Adbot
ADBOT LOVES YOU

QuarkJets
Sep 8, 2008

dougdrums posted:

Hahaha, ok. It certainly does matter.

I already addressed the rest of what you said. Nobody I know personally has felt the need to expresses how my use of a dict for a switch is morally corrupt, and I find it useful so I'm gonna keep doing it. I also provided an alternative example with several concrete arguments for why I find it preferable to using else statements.

And there's the slightest chance I was being hyperbolic in my original post, for the sake of internet posting.

E: I mean like goddamn how did you read (and quote) what I posted and then decide to write that out. I had to go back and reread them just to make sure I actually wrote what I thought. Use else statements if you want, I'm gonna stop making GBS threads things up with my heresy.

Everyone is accepting of your use of a dict as a switch statement, it's fine; I was just explaining why some people don't hold themselves to the standard you've created. Don't try to flip the script; you're the one who said that no one should ever use elif, no one cares if you choose not to use them.

And there must be some miscommunication here because really, no, if it costs no additional effort then it's fine to want to write code that runs quickly the first time

QuarkJets fucked around with this message at 06:52 on May 11, 2019

baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

mr_package posted:

How are you organizing your @dataclass objects? If each instance is essentially a row from a database, is there a good way to order/organize them? e.g. if I just shove everything into a List I need to iterate through to try and find the specific instance I'm looking for. There's probably a smarter way?
code:
@dataclass
class Language(object):
    id: int
    code: str
    name: str
Would you remove the id parameter from Languge class and use it as a dictionary key instead? e.g. languages = {1: Language('en-US', 'English (US)', 2: Language('fr-FR', 'French'} This feels a bit weird to me because I expect the id to be part of the Language object but maybe this is part of data modeling, I should write a Languages class that is a collection of Language objects and implements some friendly methods to find them?

If you think about it though, the primary key in a database row is part of the organisational system, so it makes sense to have it as the key in the dictionary. It's what you're using to identify and reference a particular data object

And sometimes that primary key is also part of the data object, like an ISBN, instead of just an internal identifier the database is using. So in that case, it makes sense to also have it in the object you're storing in the dictionary. Chances are the thing that ends up accessing it will want to make use of that attribute, but won't necessarily have access to the dictionary. And maybe you might want to change that implementation anyway - but the data should stay the same, right? Store what you need to store

So yeah you're storing the same piece of data twice, but imo that's ok, it makes sense sometimes! You'll probably want to write something that enforces that consistency - maybe a class that holds the dictionary internally, and adds stuff by automatically reading the key from your data and using that to add an item, so you never interact with the dict directly. It depends if end users care about the ID or not - I'm guessing they supply it for lookups?

(If you really care about this redundancy you could have a fetch method that adds the key to the returned data, so you store it separately but the user gets the whole thing)

Also don't forget if the IDs are in sequential order you can just index into a list instead, if that works for you

Boris Galerkin
Dec 17, 2011

I don't understand why I can't harass people online. Seriously, somebody please explain why I shouldn't be allowed to stalk others on social media!
Original post deleted.

Edit:

For that OCR thing, someone posted a link to Google’s web API where they have a way you can test it out with just uploading a picture.

https://cloud.google.com/vision/

I tried it out with that nutrition picture and here’s one thing it considers as one “text block”

quote:

7 % S u g a r s / S u c r e s 1 4 g P r o t e i n / P r o t é i n e s 2 9 C h o l e s t e r o l / C h o l e s t é r o l 0 m g S o d i u m 5 m g P o t a s s i u m 2 0 0 m g C a l c i u m 2 0 m g I r o n / F e r 0 . 5 m g V i t a m i n A , V i t a m i n e A 1 0 m o g 1 V i t a m i n C V i t a m i n e C 1 4 m g 1 6 % T h i a m i n e 0 . 0 5 m g

And it looks like the first 7% is offset and belongs to the line above it. Anyway my take on this is that if this is the best that Google can do then there probably isn’t a better solution? You’d have to invest some time sanitizing and reformatting the text it seems.

Boris Galerkin fucked around with this message at 08:41 on May 11, 2019

dougdrums
Feb 25, 2005
CLIENT REQUESTED ELECTRONIC FUNDING RECEIPT (FUNDS NOW)

punished milkman posted:

I think splitting the images up into sections with OpenCV and then extracting/parsing the text is what I'll need to do. This is way more involved than I thought... loving tables
Did you try running it through Azure's OCR? I had a bunch of cookbooks that I wanted to digitize, and was able to do it with Azure using the free credit. There were some I had to fix by hand, like where a recipe would specify a measurement with just a word (like 'pinch'), but I was able to do about 90% or so using Azure and Lark parser. They actually use a nutrition label as a demo input too.

I put that example through it and got this as output:
code:
Nutrition Facts

Valeur nutritive

Per 1 cup (122 g)

pour 1 tasse (122 9)

Calories 140

"% Dally Value "

% valeur quotidienne'

Fat / Lipides 8 g

11%

Saturated / satures 3 g

+ Trans / trans 0 g

15 %

Carbohydrate / Glucides 19 g

Fibre / Fibres 2 g

7%

Sugars / Sucres 14 g

14%

Protein / Proteines 2 g

Cholesterol / Cholesterol 0 mg

Sodium 5 mg

1%

Potassium 200 mg

4%

Calcium 20 mg

2%

Iron / Fer 0.5 mg

3 %

Vitamin A/ Vitamine A 10 mcg

Vitamin C / Vitamine C 14 mg

16 %

Thiamine 0.05 mg

4 9

Riboflavin / Riboflavine 0.05 mg

4%

'5% or less is a little, 15% or more is a lot

5% ou moins c'est peu, 15% ou plus c'est beaucoup
https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/

It'd probably be easier to write a parser sketched out like ((text label) (decimal number) ([choice of unit]) newline)+ ((digits+) % newline)) per row, with a bit of bespoke edits to accept "m9" as "mg" and deal with the inconsistent whitespace. You'd have to do some sanity checking to catch the parts where you could possibly have 0.05mg of Thiamine being 49% of the DV. But you could also just totally sidestep that issue by calculating the unit measurements to percent DV from some source, and outright ignore the given DV values.

Azure also gives bounding boxes, but for my recipe project I just used them to ensure that the whole recipe was being scanned.

dougdrums fucked around with this message at 11:45 on May 11, 2019

dougdrums
Feb 25, 2005
CLIENT REQUESTED ELECTRONIC FUNDING RECEIPT (FUNDS NOW)
Whoops ...

Dominoes
Sep 20, 2007

Consider using enums in place of booleans. This article is targetted at Rust, but is relevant to Python as well.

The article explains it well, but enums can be thought of as a generalization of booleans that 1: can have more than 2 values; (useful even when your situation only includes 2 values, in case it expands later) 2: adds a clear semantic meaning.

Caveats: Python's enums may not be performance-equivalent to boolean, it doesn't include exhaustive pattern-matching (See above discussion about switch statements/if-else etc), and the syntax for creation is slightly more verbose. Despite these, I think it's worth considering an enum, whenever your instinct points to boolean.

Dominoes fucked around with this message at 15:42 on May 11, 2019

Dominoes
Sep 20, 2007

edit: delete

fourwood
Sep 9, 2001

Damn I'll bring them to their knees.
I’m skeptical, although the match behavior in Rust does add an extra compelling reason. To me a) that doesn’t necessarily represent a typical truthiness state (there are plenty of times you’ll never conceive of adding a 3rd state because it either is or isn’t), and b) you’d probably want to be passing in some ‘hasTimedOut’ bool variable instead of a boolean literal, right?

mr_package
Jun 13, 2000

baka kaba posted:

So yeah you're storing the same piece of data twice, but imo that's ok, it makes sense sometimes!
Yeah especially for a PK that is a small int, I mean most tables are going to be fewer than 10 rows, so redundancy is small, it just looks weird to duplicate the keys/ids with my DBA hat on. (Not a DB expert by any means but I learned DBs before I learned any programming so it's really hard to view data in any other way).

baka kaba posted:

It depends if end users care about the ID or not - I'm guessing they supply it for lookups?
I think it will be hidden, this thing will probably be a web app and use the ids for internal CRUD ops but probably never show it to end users.

baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

Well by "end user" I really meant whatever's calling it (should have said client). So say your web app goes "hey I need that list of languages" and your Python code goes "here ya go, each one has an ID". If the web app is doing something like displaying all the languages, letting the user pick one, and then passing the relevant ID back to your Python code, that ID is sort of temporary - it's like getting a menu of options and going "yeah, the third one", once you get the thing you wanted you don't really need to remember what its position was anymore, right?

Whereas if you're gonna need to reference that ID again later somewhere, it makes sense to bundle it up as an attribute of the Language, since it's more of a persistent reference and the way you actually specify which object you want - it's acting more like a database, the primary key is part of the data (even if the user doesn't ever see it). It depends on exactly what you're doing, but by the sound of it the ID is probably gonna be a persistent reference, right?

If the duplication bothers you, imagine you have Languages with an ID field, and you're storing them in a list. That's basically a database table. You can search it just fine, looking at the ID field, but for speed in a database you might want to build an index, which is basically another table that holds the key (so you now have a copy in two places) and a pointer to the data in the first table. That combo of index and data table is sort of what a dictionary is, so it's not that weird to have the key present in the lookup and the data itself!

Like I said you could separate the key out for storage, and recombine it with the data when it's retrieved, but that's extra complexity that needs to be worth it

baka kaba fucked around with this message at 19:24 on May 11, 2019

Hollow Talk
Feb 2, 2014

baka kaba posted:

If the duplication bothers you, imagine you have Languages with an ID field, and you're storing them in a list. That's basically a database table. You can search it just fine, looking at the ID field, but for speed in a database you might want to build an index, which is basically another table that holds the key (so you now have a copy in two places) and a pointer to the data in the first table. That combo of index and data table is sort of what a dictionary is, so it's not that weird to have the key present in the lookup and the data itself!

I do something a bit like this in one of my work projects where I need a system of metadata that maps source data (from different sources) to target tables (there can be an arbitrarily large number of target tables per source). For simple lookups, I use the source as a dictionary key, and the actual metadata sits in NamedTuples, which, among other things, has an entry "source_table", which is the same as the dictionary key.

Since I also use this metadata for orchestration (think for entry in metadata.keys()), this allows direct lookups (metadata[source_table]) as well as loops, and I still have all data available later for additional steps (I need the source table for logging, for example). This also allows me to write more functional code, since I don't need the whole dictionary or its key(s) while looping, only its value(s), since they contain everything I need.

mr_package
Jun 13, 2000
In the end, because JSON doesn't allow numeric ids I did end up writing the database export in a list-of-dicts format. However I also kept the primary key and wrote a simple function that can read it back in and convert to dictionary-of-dictionaries where the id is the key, e.g. {1: {"id": 1, "code": "en-US"}, 2: {"id": 2, "code": "fr-FR"}}

If I find I'm not actually using it I may just leave it as a list after all, will depend how the rest of the code shakes out and whether it needs/uses it. I haven't worked in this pattern before so not sure if I'm actually gonna need the ids outside of the dict itself.

the yeti
Mar 29, 2008

memento disco



What’s the best way to avoid headaches with write access to the program files folders when installing packages on windows while also not doing anything user specific?

I think i can make a site-packages folder at an arbitrary location and add it to pythonpath for running code, but can I also get pip to use it transparently?

Sad Panda
Sep 22, 2004

I'm a Sad Panda.
Does PyCharm have an option to show names of arguments like IntelliJ does?

In this picture, 'id' and 'name' are added automatically.

Only registered members can see post attachments!

General_Failure
Apr 17, 2005
I know I'm spreading myself around threads a bit, but this is a Python problem I think.

I built TensorFlow 1.13 in a docker container. I'm pretty sure it was using the correct CPU type for the build. It built, and was packaged. I tried installing it in a venv.

code:
pip install ./tensorflow-1.13.1-cp36-cp36m-linux_x86_64.whl 
ERROR: tensorflow-1.13.1-cp36-cp36m-linux_x86_64.whl is not a supported wheel on this platform.
what.

What does it want from me?

necrotic
Aug 2, 2005
I owe my brother big time for this!
Can you just use the official docker container as your base? https://www.tensorflow.org/install/docker

General_Failure
Apr 17, 2005

necrotic posted:

Can you just use the official docker container as your base? https://www.tensorflow.org/install/docker

A little over halfway down.
https://www.tensorflow.org/install/source
I think I did.

Right now I seem to be building a CPU only Tensorflow using a Python 3 virtualenv I put together today. I just can't do the CUDA one. There's a breakage with no sufficient answers online :( Next, many hours of waiting.

While I'm here, does ImageAI actually work properly? Or does it just hate me?

shrike82
Jun 11, 2005

Err is there a reason why you're building TF manually? There are official TF images on dockerhub for almost every combination of Python, CPU/GPU you'd be interested in.

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Sad Panda posted:

Does PyCharm have an option to show names of arguments like IntelliJ does?

In this picture, 'id' and 'name' are added automatically.



Yes. But I don't remember what the option is called!

QuarkJets
Sep 8, 2008

Probably a version mismatch between the python in your venv and the version in your container

Or, likewise, between your pip versions

General_Failure
Apr 17, 2005

shrike82 posted:

Err is there a reason why you're building TF manually?
To build without AVX opcodes, and to support CUDA3.0

crazysim
May 23, 2004
I AM SOOOOO GAY

Thermopyle posted:

Yes. But I don't remember what the option is called!

They're called Parameter Hints but this doesn't seem to work for me. With it enabled, there's still no hints except if your cursor is editing the call with a popup. I'm not sure if this is an oversight or what.

What's annoying is that the documentation documents it like it works:

https://www.jetbrains.com/help/pycharm/viewing-reference-information.html#parameter-hints

But there's nothing for PyCharm. Meanwhile, for C#, Java, Ruby, and so on:

https://www.jetbrains.com/help/ruby/viewing-reference-information.html#parameter-hints

https://www.jetbrains.com/help/idea/viewing-reference-information.html#parameter-hints

https://www.jetbrains.com/help/rider/Inline_Parameter_Name_Hints.html#why-use-parameter-name-hints


Though I do think Jetbrains has it confused with the popup version: https://youtrack.jetbrains.com/issue/PY-34026

I'm going to comment on that Jetbrains issue. It's been some time and I'm really not sure why Python doesn't have parity.

shrike82
Jun 11, 2005

General_Failure posted:

To build without AVX opcodes, and to support CUDA3.0

I'm no help there but out of curiosity, what's your use-case? Some kind of embedded CV application?

General_Failure
Apr 17, 2005

shrike82 posted:

I'm no help there but out of curiosity, what's your use-case? Some kind of embedded CV application?

Ouch :( That's my PC.

shrike82
Jun 11, 2005

My bad... Why not give Google Colab a try? It's free and would let you play with CV frameworks?

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

crazysim posted:

They're called Parameter Hints but this doesn't seem to work for me.

Oh yeah, you're right. It does work in Javascript files that you're editing in PyCharm and that's what I was thinking of.

Sad Panda
Sep 22, 2004

I'm a Sad Panda.
I use logging and have calls to the logging setup with different levels throughout my program. Depending on the level they either get dealt with by 0 or more of file/screenshot/console/webservice. Debug mode is useful for me as the dev and logging the messages to console are enough for me to try to fix any issue if I'm there at the computer.

Some people are however going to run the software to see if it works and report any bugs. The messages that it'd pop up now might be enough, but I'm thinking it'd be more useful to be able to produce much more verbose logs. Is there a good way to log pretty much all commands run and variables similar to as if you were stepping through only your code in the program? I assume it'll produce excess data, but the people that are going to be running it are the kind of person who reports a bug as "It stopped working" and are not good at verbalising either what they mean or what they were doing.

To go with that, what do people use as a ticket submission setup? Thinking of creating something to log the problems they'll hopefully find as their idea mainly involves sending me a mountain of messages on Discord and that's nothing close to organised.

duck monster
Dec 15, 2004

QuarkJets posted:

Tell me more about these antics

Aight. So yeah Goonfleet. It wasn't really anything sanctioned by the leadership, just a few of us loving around in private. We basically noticed that the log files where stored in a python pickle format (Once you've seen it in hex, its not hard to miss) so I started up creating something that could be used to raise alarms when certain ships entered the system because the system would log gate activations. From there we worked out their MACHOnet implementation worked out much the same. There was encryption, but it was trivial to decode because the keys where in the client. So you could listen in on the wire protocol, and oh boy was it promiscuous. Turns out when your ship approached an enemy POS, the POS would bark out the shield password hashed, and if your hash matched its hashed, you where in. Not sure if it trusted the client, I think it was more the client and server kept parallel simulations. Anyway it turns out it wasn't THAT hard to decrypt the hashes on a GPU, since the salts where in plain text on the client. But we refrained passing the info on to leadership because it was pretty clear that its the kind of hijinx that could get people in lots of trouble, and we where more just curious than wanting to cheat. If memory serves me right we ended up passing the exploits to the devs.

Oh, and at some point someone that wasnt us abused the remote code execution in python pickles to convinced the server to give up its entire source code which we got passed a copy of. It was..... surprisingly well written and I got a much better appreciation of the kinds of clusterfucks an MMO as complicated as that presented. There where a few exploits in there (Including one that let you make announcements as GM to a sector which got abused relentlessly, again not by us, for comical effect until the devs fixed it

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

duck monster posted:

Oh, and at some point someone that wasnt us abused the remote code execution in python pickles to convinced the server to give up its entire source code

I'm going to try and remember that as a concrete example of why pickle is dangerous. Usually I can't ever think of any examples.

pmchem
Jan 22, 2010


duck monster posted:

Aight. So yeah Goonfleet.

You summoned me with this story. Hi.

Boris Galerkin
Dec 17, 2011

I don't understand why I can't harass people online. Seriously, somebody please explain why I shouldn't be allowed to stalk others on social media!
So when you say entire source code do you mean the entire game was written in Python, or that the exploit was able to locate the game’s git repository on their production servers and transfer out a tar.gz file of it?

pmchem
Jan 22, 2010


The entire game was written in python, client and server, with occasional interfaces to backend databases or other frameworks (e.g. graphics) as needed.

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Sounds like there was some drama at the Python language summit.

The general idea:

Python is loving up it's ecosystem with it's poor standard library.

I like where Amber Brown is coming from there.

In a lot of ways I think that if you're making Guido mad you've got good ideas!

NinpoEspiritoSanto
Oct 22, 2013




What's wrong with the standard library? Always thought there's plenty of useful stuff in it?

Nippashish
Nov 2, 2005

Let me see you dance!

Bundy posted:

What's wrong with the standard library? Always thought there's plenty of useful stuff in it?

The linked post sums it up pretty neatly I think. The standard library is full of worse versions of community developed things, but having a problem poorly solved in the standard library has a crowding out effect on better solutions.

Hollow Talk
Feb 2, 2014

Bundy posted:

What's wrong with the standard library? Always thought there's plenty of useful stuff in it?

And it will only get better: http://charlesleifer.com/blog/new-features-planned-for-python-4-0/

Malcolm XML
Aug 8, 2009

I always knew it would end like this.
Yeah I'm glad the python community is realizing that Guido made some bad choices regarding the stdlib and keeping cpython a slow mess

Malcolm XML
Aug 8, 2009

I always knew it would end like this.
Lmao at Guido taking it persomally when someone with expertise makes constructive criticism about his baby and he storms out

Adbot
ADBOT LOVES YOU

CarForumPoster
Jun 26, 2013

⚡POWER⚡

punished milkman posted:

Anyone have any package suggestions for extracting tables of data from image files (.png/.jpg) ? I tried using Tesseract/pytesseract and while it's doing a great job of detecting the text, the tabular aspect of it is totally lost and I couldn't find a straight forward path to processing tables with it. I've used Camelot with PDFs before, and it worked OK (at best), but I'm hoping to use something else this time around.

Extortionist posted:

This isn't an easy problem. If the images are fairly consistent you can try using one of the tesseract outputs that supplies word coordinates and do your own table determination based on the relative positions of words. It might also be useful to run the images through opencv first to extract the positions of the lines (possibly also removing them from the image, or splitting into several small images prior to OCR).

You might look at AWS Textract (still in preview) or the Google/Azure OCR services, too, if someone's paying for it.

Chiming in to say Docucharm is pretty good at this if theyre of a vaguely consistent format. For example I needed to typed and printed-> scanned reports into structured data. I did a compare of them to Textract (which I got access to early) and they were much better. Havent used either in about 4 months so cant say if either has made large progress.

CarForumPoster fucked around with this message at 22:11 on May 18, 2019

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply