Python

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

Nippashish: Nov 2, 2005; Let me see you dance!

Comment before the if is about "why is there a branch here?" Comment inside the if is about "what is the purpose of this path?"

# ? May 9, 2019 21:35

Adbot: ADBOT LOVES YOU

# ? May 15, 2024 07:16

Data Graham: Dec 28, 2009; 📈📊🍪😋

Thermopyle posted:

I prefer the comment before the if block.

However, this does not make any sense and I do not know why I prefer it.

I mean, you don't put a docstring before the function or class definition starts.

This bothers me because every time I write such a comment, it's weighing on me to choose between what I prefer and what seems technically correct.

Just think of a comment as a freestyle decorator

# ? May 9, 2019 21:49

mbt: Aug 13, 2012

I use both and treat under the if like a docstring and above the if like a strong warning to whichever poor soul is using my code

# ? May 9, 2019 21:56

mr_package: Jun 13, 2000

When working with relational data, do you keep it in relational format or just flatten it all? I'm pulling a bunch of stuff out of a database and the tables are all normalized correctly, and I'm not sure whether to keep the small id:text tables. So for example, each item has a "language_code" parameter, in the database it might be 1/2/3 for en-US/fr-FR/de-DE but I could easily pull that out and have it be "en-US" instead of 1 and not even maintain this separate table at all. Advantages of being more human readable might outweigh the bloat and slowness of trying to update all 5k items if we changed to "en_us" or something.

I'm used to working with databases, so normalized data is totally standard to me but maybe in Python / JSON this becomes an antipattern? Anyone ever dealt with this one way or the other and then regretted it?

# ? May 9, 2019 22:18

pmchem: Jan 22, 2010

I mean, I can imagine settling all style arguments with Black.
https://github.com/python/black

It's being used to format core python code now and is part of PSF.

# ? May 9, 2019 23:23

dougdrums: Feb 25, 2005; CLIENT REQUESTED ELECTRONIC FUNDING RECEIPT (FUNDS NOW)

I'd say before the if and never use elif.

Before the if because it's clear that the comment expresses something about the comparison as a whole, if that makes sense.

Never use elif because you should return or continue instead; or use a dict with (nested) functions if you want to switch on a value.

E: oh jeez I'm a page too slow.

# ? May 10, 2019 01:08

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

pmchem posted:

I mean, I can imagine settling all style arguments with Black.
https://github.com/python/black

It's being used to format core python code now and is part of PSF.

I use Black everywhere I can (aka, where there are not other styling conventions).

But, I don't think it has a stance on this particularly styling thingamajig.

# ? May 10, 2019 01:26

Dr Subterfuge: Aug 31, 2005; TIME TO ROC N' ROLL

dougdrums posted:

I'd say before the if and never use elif.

Before the if because it's clear that the comment expresses something about the comparison as a whole, if that makes sense.

Never use elif because you should return or continue instead; or use a dict with (nested) functions if you want to switch on a value.

E: oh jeez I'm a page too slow.

What does switching like that on a dict look like in practice?

# ? May 10, 2019 01:47

dougdrums: Feb 25, 2005; CLIENT REQUESTED ELECTRONIC FUNDING RECEIPT (FUNDS NOW)

I'm phonepostin' and otherwise preoccupied so to speak, so this isn't really a useful example but somehing like:

Python code:


def paint_color(color : Color):
  return paint({
    Color.RED: '#ff0000',
    Color.GREEN: '#00ff00',
    Color.BLUE: '#0000ff'
  }.get(color, '#000000'))

Or maybe:

Python code:


def route(page : str, *args):
  return {
    'index': index_view,
    'about': about_view,
    'contact': contact_view,
  }.get(page, not_found_view)(*args)

You can get really wacky with this and functools or somethig but that's the basic idea.

Oh and if you want to match on regex or an expression, you can filter with a comprehension over the dict then reduce.

Else Statement Considered Harmful

dougdrums fucked around with this message at 02:41 on May 10, 2019

# ? May 10, 2019 02:14

susan b buffering: Nov 14, 2016

mr_package posted:

When working with relational data, do you keep it in relational format or just flatten it all? I'm pulling a bunch of stuff out of a database and the tables are all normalized correctly, and I'm not sure whether to keep the small id:text tables. So for example, each item has a "language_code" parameter, in the database it might be 1/2/3 for en-US/fr-FR/de-DE but I could easily pull that out and have it be "en-US" instead of 1 and not even maintain this separate table at all. Advantages of being more human readable might outweigh the bloat and slowness of trying to update all 5k items if we changed to "en_us" or something.

I'm used to working with databases, so normalized data is totally standard to me but maybe in Python / JSON this becomes an antipattern? Anyone ever dealt with this one way or the other and then regretted it?

I tend to keep things relational by having my DB models represented as classes with relevant methods for accessing related objects.
So in your case, I'd have something like:

code:

class Text(object):
    lang():
	 self.conn().execute("SELECT * FROM lang WHERE id = ?", self.lang_id)
         lang = self.conn().fetchone()
	 return lang

This way you can have the human readable value without de-normalizing the database.

You can even add a @property decorator above the method so it can be accessed as Text().lang.

I would recommend looking into ORM and datamapper design patterns for more ideas.

# ? May 10, 2019 02:44

mr_package: Jun 13, 2000

skull mask mcgee posted:

This way you can have the human readable value without de-normalizing the database.

Thanks that is something I was thinking about too-- keep the schema as-is and then just try to provide a clean interface (if only to myself) to it.

One wrinkle is I'm basically migrating from a database to JSON (seems friendliest serialization format). But your approach would still work, it would just be reading it out of a dictionary (which is itself parsed JSON) instead of directly querying the db. I'm essentially doing SQL export to JSON, and trying to decide on what the schema/design of what that JSON should be.

I am not sure how much 'first normal form' style of data modeling to maintain in this case. The DBA in me says 'don't throw away this data / schema, it's useful and correct' but pragmatically I look at this and say 'actually YAGNI, just write it out as a config and forget the old database'.

I suppose the fundamental question is: if you're writing an app that is using a kind of medium-small data set (20k records or so?) but not using a database backend or even SQLite, would you still always always always model that data as relational? Or would you kind of cheat and make it a slightly bloated JSON config file and not worry about it too much? Is there a rule I just don't know about, like "No, it doesn't matter, you don't need a database so just use whatever format is most readable" or maybe "Yes for the love of god keep the data normalized it will save you so much pain in six months when you need to add another platform/target /os".

# ? May 10, 2019 05:43

Boris Galerkin: Dec 17, 2011; I don't understand why I can't harass people online. Seriously, somebody please explain why I shouldn't be allowed to stalk others on social media!

fourwood posted:

I think I prefer a comment before the if but then after the else.... I�m a monster.

I do this too which is why I asked! Comment above the if makes sense to me, but comment above the elif (or the else) looks misplaced so I bring it into the elif/else block. But then I lose 4 characters

pmchem posted:

I mean, I can imagine settling all style arguments with Black.
https://github.com/python/black

It's being used to format core python code now and is part of PSF.

Black is great and I use it when I can, but yeah it doesn�t come into the equation for what I�m asking (and for good reason? I�m fine with code formatters deciding my comment was too long and making it into multiple lines� but it probably shouldn�t touch where I put the comment).

dougdrums posted:

I'd say before the if and never use elif.

Before the if because it's clear that the comment expresses something about the comparison as a whole, if that makes sense.

Never use elif because you should return or continue instead; or use a dict with (nested) functions if you want to switch on a value.

E: oh jeez I'm a page too slow.

I guess I was talking more about 1:1 math functions, like the one I stole from here:

It normally ends up programmed like so:

code:

if x < 2:
    y = x ** 2
elif x == 2:
    y = 6
elif x <= 6:
    y = 10 - x
else:
    raise ValueError('x is only defined in (-inf, 6]')

This is the most simple and straightforward way to implement that function, that anyone could understand, with absolutely no chance of confusion. That's arguably more important than being idiomatic, at least when it comes to math related things. (Edit: technically you would never want to check for x == 2 either, but that's besides the point.)

Boris Galerkin fucked around with this message at 07:23 on May 10, 2019

# ? May 10, 2019 05:51

QuarkJets: Sep 8, 2008

Nippashish posted:

Comment before the if is about "why is there a branch here?" Comment inside the if is about "what is the purpose of this path?"

Yeah I do it that way, too.

Thermopyle posted:

I prefer the comment before the if block.

However, this does not make any sense and I do not know why I prefer it.

I mean, you don't put a docstring before the function or class definition starts.

This bothers me because every time I write such a comment, it's weighing on me to choose between what I prefer and what seems technically correct.

This is how i write docstrings:

Python code:

In [1]: # this is some function
   ...: # with a cool description
   ...: def test():
   ...:     pass
   ...:     

In [2]: help(test)
Help on function test in module __main__:

test()
    # this is some function
    # with a cool description

# ? May 10, 2019 09:03

QuarkJets: Sep 8, 2008

dougdrums posted:

I'd say before the if and never use elif.

Before the if because it's clear that the comment expresses something about the comparison as a whole, if that makes sense.

Never use elif because you should return or continue instead; or use a dict with (nested) functions if you want to switch on a value.

E: oh jeez I'm a page too slow.

I don't think that this is good advice; if/elif/else give you much more control and capability than a dict and I can think of numerous examples where I'd want to use these builtin Python keywords instead of hacking together some sort of dictionary implementation (which I assume would also be much slower on repeated calls?)

QuarkJets fucked around with this message at 09:08 on May 10, 2019

# ? May 10, 2019 09:06

dougdrums: Feb 25, 2005; CLIENT REQUESTED ELECTRONIC FUNDING RECEIPT (FUNDS NOW)

Boris Galerkin posted:

I guess I was talking more about 1:1 math functions, like the one I stole from here:

Switch cases in other languages don't normally accept expressions. I wouldn't use lambdas or functions as a key, but you could, and it's a handy thing to do if you want to match multiple cases, or be stubbornly idiomatic (via next).

For the example given, I'm of the opinion that it is always better to write a seperate function in the form:

Python code:


def f(x):
  if x < 2:
    return x*x
  if x == 2:
    return 6
  if x <= 6:
    return 10 - x
  raise ValueError('x is only defined in (-inf, 6]')
y = f(z)

Because:

You're forced to give it a name, possibly obliviating the need for a comment anyways.
You can be certain of the state of the rest of the function, outside of the scope of your branches.
It's not possible to return from the outer function within your branches, complicating control flow.
It can be typed and the type can be validated, if thats your thing.

In compiled languages this would be inlined anyways, no idea if it happens in python.

QuarkJets posted:

I don't think that this is good advice; if/elif/else give you much more control and capability than a dict and I can think of numerous examples where I'd want to use these builtin Python keywords instead of hacking together some sort of dictionary implementation (which I assume would also be much slower on repeated calls?)

I don't do this with expressions as keys, because I don't want coworkers/contributors to kill me in my sleep, but I'd argue that you actually have more control for mostly the same reasons listed above. (That is, a generator comprehension inside of next() or with a filter clause.)

In the case where you're simply matching by equality (or equality of a type) like a more traditional switch construct, I definitely prefer it. Calling this simple case 'hacky' is naive. A lot of times when I'm writing a switch, I really just want a table of calls, possibly wrapped in another call. A dict with functions as values expresses my intent exactly.

If the performance of this is really an issue, you shouldn't be using python. If you must and are are still concerned that it is creating a dict object each time (no idea if this is the case, but I would also assume so), you can define an instance of it immediately before the function definition and refer to that.

# ? May 10, 2019 10:58

Boris Galerkin: Dec 17, 2011; I don't understand why I can't harass people online. Seriously, somebody please explain why I shouldn't be allowed to stalk others on social media!

QuarkJets posted:

code:

In [1]: # this is some function
   ...: # with a cool description
   ...: def test():
   ...:     pass
   ...:     

In [2]: help(test)
Help on function test in module __main__:

test()
    # this is some function
    # with a cool description

TIL you can use syntax highlighting in code blocks� and also that comments don't get rendered :v:

# ? May 10, 2019 11:39

QuarkJets: Sep 8, 2008

dougdrums posted:

If the performance of this is really an issue, you shouldn't be using python. If you must and are are still concerned that it is creating a dict object each time (no idea if this is the case, but I would also assume so), you can define an instance of it immediately before the function definition and refer to that.

Python is a commonly used language in the HPC domain for its ability to act as a glue language, but inevitably performance-impacting code will sometimes get written in Python. Don't assume that people who use python don't care at all about performance.

It's fine if you prefer using dictionaries as switch statements and lots of function calls, but it's not reasonable to call that combination a one-size-fits-all approach that's always superior to if/elif/else statements. There are cases where I'd rather have those kinds of blocks, and others where a function call will serve nicely instead

# ? May 10, 2019 11:39

dougdrums: Feb 25, 2005; CLIENT REQUESTED ELECTRONIC FUNDING RECEIPT (FUNDS NOW)

QuarkJets posted:

Python is a commonly used language in the HPC domain for its ability to act as a glue language, but inevitably performance-impacting code will sometimes get written in Python. Don't assume that people who use python don't care at all about performance.

Sure, but this construct is probably not going to be the slowest part of your python program. If it comes to that, you should consider implementing it in a compiled language. (Or at least that part of your program.)

QuarkJets posted:

It's fine if you prefer using dictionaries as switch statements and lots of function calls, but it's not reasonable to call that combination a one-size-fits-all approach that's always superior to if/elif/else statements. There are cases where I'd rather have those kinds of blocks, and others where a function call will serve nicely instead

I'm not really gonna dog on someone for using else statements, because it's part of the language. My disdain for the else statement comes from implementing it in DSLs that already have functions, because it's redundant and always requires additional syntax to disambiguate. It's just my preference and I like to be consistent with it. Fwiw I haven't used an else statement in any language for like at least 5 years.

# ? May 10, 2019 12:19

Boris Galerkin: Dec 17, 2011; I don't understand why I can't harass people online. Seriously, somebody please explain why I shouldn't be allowed to stalk others on social media!

I had a longer response typed out but I'll just say this:

In an ideal world all code would be bug-free, tested, documented, and just work and I would have all the time in the world to do all the things that should be done. But that's not the world I live in, and sometimes I have to trade performance vs convenience by computing stuff directly in Python or Matlab on my laptop with limited RAM and a crippled mobile CPU, instead of spending time chasing better implementations for a thing that I'm only going to run a handful of times.

# ? May 10, 2019 13:36

dougdrums: Feb 25, 2005; CLIENT REQUESTED ELECTRONIC FUNDING RECEIPT (FUNDS NOW)

Python excels at sketching things out, and getting poo poo done, but those things aren't necessarily mutually inclusive.

Whether or not you use else statements vs. a separate function is splitting hairs, especially if it's something you only use a few times. If it's something you only need to run a few times and you find yourself optimizing a python program in detail for it to finish in a reasonable timeframe, it should've been written in a different language from the start. You've wasted time having used python in the first place. Python won't meet the requirements, and it was a mistake to assume that it would.

On the topic of not using else statements, or writing switches as dicts, it doesn't really take extra time or effort to figure out or implement, so I'm a bit confused if that's what you're referring to.

If it's something you later intend to run on a cluster, hashing it out it python is good for a proof of concept. Ime if you have an application for HPC and the performance of python code is a pain point, trying to optimize it is a waste of time and money vs. porting it.

# ? May 10, 2019 14:40

NinpoEspiritoSanto: Oct 22, 2013

I've had a few cases that writing better code and running with pypy instead of cpython has solved whatever performance troubles I was having.

Incidentally pypy with 3.6 syntax/features is now considered beta and the unicode improvements haven't half made string processing fast as hell.

# ? May 10, 2019 15:11

dougdrums: Feb 25, 2005; CLIENT REQUESTED ELECTRONIC FUNDING RECEIPT (FUNDS NOW)

Pypy is good poo poo. I've had people tell me that they didn't want to use pypy because they thought tracing was bad but I was like, ok, you're paying anyways.

# ? May 10, 2019 15:19

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

QuarkJets posted:

This is how i write docstrings:

There's always someone who has to be difficult.

# ? May 10, 2019 16:14

punished milkman: Dec 5, 2018; would have won

Anyone have any package suggestions for extracting tables of data from image files (.png/.jpg) ? I tried using Tesseract/pytesseract and while it's doing a great job of detecting the text, the tabular aspect of it is totally lost and I couldn't find a straight forward path to processing tables with it. I've used Camelot with PDFs before, and it worked OK (at best), but I'm hoping to use something else this time around.

# ? May 10, 2019 16:19

Hollow Talk: Feb 2, 2014

Thermopyle posted:

There's always someone who has to be difficult.

If it's consistent, this is at least easily parsable via inspect.getcomments(), which means one can automatically rewrite it as proper docstrings. It's unnecessary, but at least fixable. :haw:

# ? May 10, 2019 16:32

a dingus: Mar 22, 2008; Rhetorical questions only; Fun Shoe

punished milkman posted:

Anyone have any package suggestions for extracting tables of data from image files (.png/.jpg) ? I tried using Tesseract/pytesseract and while it's doing a great job of detecting the text, the tabular aspect of it is totally lost and I couldn't find a straight forward path to processing tables with it. I've used Camelot with PDFs before, and it worked OK (at best), but I'm hoping to use something else this time around.

Can you extract the information into a tuple and create a table in something like pandas?

# ? May 10, 2019 17:54

EVIL Gibson: Mar 23, 2001; Internet of Things is just someone else's computer that people can't help attaching cameras and door locks to!; Switchblade Switcharoo

Thermopyle posted:

I prefer the comment before the if block.

However, this does not make any sense and I do not know why I prefer it.

I mean, you don't put a docstring before the function or class definition starts.

This bothers me because every time I write such a comment, it's weighing on me to choose between what I prefer and what seems technically correct.

thaaaaats python. putting shame into your coding style for no good reason.

# ? May 10, 2019 18:13

the yeti: Mar 29, 2008; memento disco

EVIL Gibson posted:

thaaaaats python. putting shame into your coding style for no good reason.

That and not having a switch, we hit all the high notes on one page :v:

# ? May 10, 2019 18:25

punished milkman: Dec 5, 2018; would have won

a dingus posted:

Can you extract the information into a tuple and create a table in something like pandas?

That is the plan, but some of the cells span multiple lines so the parsing becomes difficult.

This is the kind of thing I'm trying to make sense of with OCR:
https://images.app.goo.gl/aDrkvVibCzvGnbNy7

# ? May 10, 2019 18:48

Baronash: Feb 29, 2012; So what do you want to be called?

shrike82 posted:

There's an amazing Python ebook bundle on Humble bundle - https://www.humblebundle.com/books/python-oreilly-books

Fluent Python alone is worth the price of entry.

A little late, but thank you for linking this deal.

# ? May 10, 2019 18:50

susan b buffering: Nov 14, 2016

mr_package posted:

Thanks that is something I was thinking about too-- keep the schema as-is and then just try to provide a clean interface (if only to myself) to it.

One wrinkle is I'm basically migrating from a database to JSON (seems friendliest serialization format). But your approach would still work, it would just be reading it out of a dictionary (which is itself parsed JSON) instead of directly querying the db. I'm essentially doing SQL export to JSON, and trying to decide on what the schema/design of what that JSON should be.

I am not sure how much 'first normal form' style of data modeling to maintain in this case. The DBA in me says 'don't throw away this data / schema, it's useful and correct' but pragmatically I look at this and say 'actually YAGNI, just write it out as a config and forget the old database'.

I suppose the fundamental question is: if you're writing an app that is using a kind of medium-small data set (20k records or so?) but not using a database backend or even SQLite, would you still always always always model that data as relational? Or would you kind of cheat and make it a slightly bloated JSON config file and not worry about it too much? Is there a rule I just don't know about, like "No, it doesn't matter, you don't need a database so just use whatever format is most readable" or maybe "Yes for the love of god keep the data normalized it will save you so much pain in six months when you need to add another platform/target /os".

I probably would, but I also try and keep my models fairly agnostic of where the actual data is coming from. Common properties would come from a base class/mix-ins or sometimes stored in an object/dict passed into the instance.

When it comes to serialization, I'm definitely doing a bit of flattening. 2-column id:text tables are probably just going to be represented as the text. Related tables with more fields get an object(take care if your db structure allows for recursive dependencies :v:

) and/or locator uri if this is being served as a REST API.

I personally try and avoid de-normalizing an already normalized database, especially as a shortcut to what my bespoke ORM should handle in code anyways. I totally believe you can make an educated decision to do so and be fine. For instance, with the lang table you mentioned in the first post, one could argue that since language tags are already standardized, keeping them in a separate table is excessive. OTOH that sort of table structure can be a real benefit if you need to track down / prevent typos or other invalid data.

I've actually been doing basically the opposite of what you've been doing, which is writing a wrapper for a REST api I pull data from pretty regularly, which will probably end up in a sqlite database. I made heavy use of dataclasses, which I highly recommend looking into(namedtuples, too).

Here's my base model class and one of the child classes.

Python code:

class Base(object):
    def __init__(self, client: object, data: dict) -> None:
        super().__init__()
        self.client = client
        if data is not None:
            for attribute, value in data.items():
                setattr(self, attribute, value)

@dataclass(init=False)
class Persona(Base):
    id: Optional[int] = None
    name: Optional[str] = None
    bio: Optional[str] = None
    since: Optional[int] = None
    email: Optional[str] = None
    website: Optional[str] = None
    image: Optional[str] = None

    def playlists(self, **kwargs) -> Listing:
        params = kwargs
        params["persona_id"] = self.id
        return self.client.playlists(params=params)

The nice part about this is that if I decide to load this data into a database, I can use these models to read from it again so long as I have a client object with the same methods as the client wrapper I wrote for the api client.

susan b buffering fucked around with this message at 19:06 on May 10, 2019

# ? May 10, 2019 18:57

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

punished milkman posted:

Anyone have any package suggestions for extracting tables of data from image files (.png/.jpg) ? I tried using Tesseract/pytesseract and while it's doing a great job of detecting the text, the tabular aspect of it is totally lost and I couldn't find a straight forward path to processing tables with it. I've used Camelot with PDFs before, and it worked OK (at best), but I'm hoping to use something else this time around.

I cannot remember the specifics, but I feel like Microsoft or Google have an Azure/Google Cloud API to do this.

# ? May 10, 2019 19:35

QuarkJets: Sep 8, 2008

dougdrums posted:

Sure, but this construct is probably not going to be the slowest part of your python program. If it comes to that, you should consider implementing it in a compiled language. (Or at least that part of your program.)

It doesn't matter whether it's the slowest part. The point is that you're trying to convince people to change convention in a way that sacrifices some performance for no tangible benefit.

Lots of people combine compiled code with Python, it's an extremely common usecase. The tradeoff between developer time and software performance is not always simple, and I would rather continue using a style that costs no additional time, is highly legible, and is naturally performant for the myriad cases where I don't intend to apply a profiler to a project.

# ? May 10, 2019 22:31

mr_package: Jun 13, 2000

skull mask mcgee posted:

I made heavy use of dataclasses, which I highly recommend looking into(namedtuples, too).

How are you organizing your @dataclass objects? If each instance is essentially a row from a database, is there a good way to order/organize them? e.g. if I just shove everything into a List I need to iterate through to try and find the specific instance I'm looking for. There's probably a smarter way?

code:

@dataclass
class Language(object):
    id: int
    code: str
    name: str

Would you remove the id parameter from Languge class and use it as a dictionary key instead? e.g. languages = {1: Language('en-US', 'English (US)', 2: Language('fr-FR', 'French'} This feels a bit weird to me because I expect the id to be part of the Language object but maybe this is part of data modeling, I should write a Languages class that is a collection of Language objects and implements some friendly methods to find them?

edit: also kind of looks like a lot of work to make @dataclass objects JSON serializeable but maybe that's just me being lazy. This feels like a lot of boilerplate just to make the json module happy. https://martin-thoma.com/make-json-serializable/ But I did also find jsonpickle which I'd never seen before which solves this problem http://jsonpickle.github.io/

mr_package fucked around with this message at 23:25 on May 10, 2019

# ? May 10, 2019 22:56

shrike82: Jun 11, 2005

punished milkman posted:

Anyone have any package suggestions for extracting tables of data from image files (.png/.jpg) ? I tried using Tesseract/pytesseract and while it's doing a great job of detecting the text, the tabular aspect of it is totally lost and I couldn't find a straight forward path to processing tables with it. I've used Camelot with PDFs before, and it worked OK (at best), but I'm hoping to use something else this time around.

Table detection isn't a solved problem even with current deep learning models.

# ? May 11, 2019 01:06

Extortionist: Aug 31, 2001; Leave the gun. Take the cannoli.

punished milkman posted:

Anyone have any package suggestions for extracting tables of data from image files (.png/.jpg) ? I tried using Tesseract/pytesseract and while it's doing a great job of detecting the text, the tabular aspect of it is totally lost and I couldn't find a straight forward path to processing tables with it. I've used Camelot with PDFs before, and it worked OK (at best), but I'm hoping to use something else this time around.

This isn't an easy problem. If the images are fairly consistent you can try using one of the tesseract outputs that supplies word coordinates and do your own table determination based on the relative positions of words. It might also be useful to run the images through opencv first to extract the positions of the lines (possibly also removing them from the image, or splitting into several small images prior to OCR).

You might look at AWS Textract (still in preview) or the Google/Azure OCR services, too, if someone's paying for it.

# ? May 11, 2019 02:57

punished milkman: Dec 5, 2018; would have won

Extortionist posted:

This isn't an easy problem. If the images are fairly consistent you can try using one of the tesseract outputs that supplies word coordinates and do your own table determination based on the relative positions of words. It might also be useful to run the images through opencv first to extract the positions of the lines (possibly also removing them from the image, or splitting into several small images prior to OCR).

You might look at AWS Textract (still in preview) or the Google/Azure OCR services, too, if someone's paying for it.

I think splitting the images up into sections with OpenCV and then extracting/parsing the text is what I'll need to do. This is way more involved than I thought... loving tables

# ? May 11, 2019 04:03

KICK BAMA KICK: Mar 2, 2009

Just curious, is there a long story short on why tables are hard?

# ? May 11, 2019 04:27

punished milkman: Dec 5, 2018; would have won

KICK BAMA KICK posted:

Just curious, is there a long story short on why tables are hard?

From my own experience it's because there are very few rules followed by tables beyond there being some semblance of aligned columns/rows of related data. Our brains are pretty good at contextualizing and making sense of what we see in a table, but there are a ton of potential subtle nuances that make a universal/generic computational solution very difficult.

# ? May 11, 2019 04:40

Adbot: ADBOT LOVES YOU

# ? May 15, 2024 07:16

dougdrums: Feb 25, 2005; CLIENT REQUESTED ELECTRONIC FUNDING RECEIPT (FUNDS NOW)

QuarkJets posted:

It doesn't matter whether it's the slowest part.

Hahaha, ok. It certainly does matter.

I already addressed the rest of what you said. Nobody I know personally has felt the need to expresses how my use of a dict for a switch is morally corrupt, and I find it useful so I'm gonna keep doing it. I also provided an alternative example with several concrete arguments for why I find it preferable to using else statements.

And there's the slightest chance I was being hyperbolic in my original post, for the sake of internet posting.

E: I mean like goddamn how did you read (and quote) what I posted and then decide to write that out. I had to go back and reread them just to make sure I actually wrote what I thought. Use else statements if you want, I'm gonna stop making GBS threads things up with my heresy.

dougdrums fucked around with this message at 05:50 on May 11, 2019

# ? May 11, 2019 05:35

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

«‹›230 »