Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
shrike82
Jun 11, 2005

HumbleBundle is offering a bunch of Python books (mostly newbie/intermediate level) on the cheap -
https://www.humblebundle.com/books/python-book-bundle

Adbot
ADBOT LOVES YOU

shrike82
Jun 11, 2005

At work, I typically deal with small datasets pulled thru financial APIs (100K records at most) that I interact with using numpy/pandas/flat files.

I'm hobby tinkering around with IoT stuff on a Raspberry Pi to log metered data (e.g., temperature, humidity, MACs connected to home wifi) as well as logging other stuff. I'm using ThingSpeak and it works well for simple logging.

What's a solution for storing something that generates ~3K records every 5 minutes? I'd default to python-sqlalchemy-sqlite with the DB file backed up every day to DropBox. Is there a nicer free(-ish) online solution?

shrike82
Jun 11, 2005

Which is a roundabout way of saying that Boris was right with his initial solution.

From Hettinger's Descriptor HowTo
code:
class Function(object):
    . . .
    def __get__(self, obj, objtype=None):
        "Simulate func_descr_get() in Objects/funcobject.c"
        return types.MethodType(self, obj, objtype)
I think everyone here, including me, just found out that attaching a function to an instance doesn't automatically bind it, which is an interesting quirk.

shrike82
Jun 11, 2005

As always, 'Fluent Python' has something to say about this with chapters on descriptors and meta-programming.

Boris, I'd recommend you get it if you haven't, if you really want to delve into the languages.

shrike82
Jun 11, 2005

Basically, and to your earlier question, when you use dot notation in Python (Class.attribute or Instance.attribute), you're not directly accessing the attribute.

You're calling the __setattr__ or __getattribute__ (and potentially __getattr__) magic functions and passing the attribute to those functions. Attributes that implement the Descriptor protocol (i.e. implement __get__ and/or __set__ functions) affect the earlier mentioned magic functions. Because functions implement the __get__ function, there's some funkiness.

Your thought about using MethodTypes to manually bind the function is right.

shrike82
Jun 11, 2005

Again,

from Hettinger's Descriptor HowTo
code:
class Function(object):
    . . .
    def __get__(self, obj, objtype=None):
        "Simulate func_descr_get() in Objects/funcobject.c"
        return types.MethodType(self, obj, objtype)
Or
https://twitter.com/raymondh/status/771628923100667904

Which is effectively the same

I'm very well aware of the issues with playing with metaprogramming, but just saying it's bad without providing a feasible alternative kinda sucks.

shrike82 fucked around with this message at 10:35 on Apr 20, 2017

shrike82
Jun 11, 2005

There's nothing odd about talking about dependency injection, outside of DI/IoC being slightly less 'popular' in Python-land due to dynamic typing making it trivial to handle.

shrike82
Jun 11, 2005

Epsilon Plus posted:

As a relative newbie, if dependency injection is less popular in Python than it is in other languages, how do you handle cases in Python that you'd typically use dependency injection for in, say, Java?

Sorry, I meant that it's so trivial in Python that people don't tend to think "Oh, I need to use a DI/IoC framework here".
More generally, and imho, people spend less time thinking about design patterns when building stuff in Python.

This is a good read if you want to learn more about Python design patterns
http://www.aleax.it/gdd_pydp.pdf

shrike82 fucked around with this message at 01:28 on Apr 21, 2017

shrike82
Jun 11, 2005

I've forgotten most of my Java but doesn't it have the super keyword which is used roughly in the same way?

shrike82
Jun 11, 2005

Actually, go to platforms like hackerrank and codewars. They have algo problems as well as Python specific problems with a built-in interpreter that runs your program and compares its output for a bunch of test cases.

shrike82
Jun 11, 2005

Do most of you run Python in Windows?

Package issues like this and the difficulty getting tensorflow running on Windows drove me to a dual boot Ubuntu setup.

It's a pain in the rear end to setup but I've found switching to a Linux build helpful in mitigating stuff like this.

shrike82
Jun 11, 2005

Just use the Decimal class

shrike82
Jun 11, 2005

yup, google's TPUs use quantization to improve inference performance so you can't say that ML practitioners can ignore FP issues

shrike82
Jun 11, 2005

An ML practitioner or academic could decide to ignore precision based on their needs, but I'd find it hard to argue that one can make an informed decision without understanding the basics of floating point arithmetic.

No argument about Malcolm acting like a douchebag but hey, being a coder in the finance industry sucks so it's not surprising that people like him behave that way.

shrike82
Jun 11, 2005

Out of curiosity, what is the NN problem you're working with?
Training only for a single epoch for a given training set (some kind of augmenting going on there?) and data generation taking 30% of training time smells of there being room for cleaning up the code around there before spending too much time on shifting to a multiprocess setup.

shrike82
Jun 11, 2005

Tried PyPy?

shrike82
Jun 11, 2005

is there a rule of thumb for deciding whether to use the fork or forkserver method for spawning processes when using the multiprocessing library?
i'm noticing a 2x speed differential depending on the workload, and it's not consistently one method over the other.

shrike82
Jun 11, 2005

For people getting to grips with pandas, the book "Python for Data Analysis" written by pandas creator Wes McKinney is a good primer.

shrike82
Jun 11, 2005

Project Euler focuses more on the math/stats side of things rather than teach coding so really depends on what you're trying to teach them.

Codewars is more oriented towards the coding side of things.

shrike82
Jun 11, 2005

Any suggestions on scraping frameworks? I've always defaulted to requests + BS4 & falling back to Selenium + Chrome if I need funky stuff.
But Scrapy looks quite appealing with its baked-in concurrency and plug-in support for stuff like proxies.

shrike82
Jun 11, 2005

As an aside, I've gotten in the habit of using Docker to isolate environments in lieu of something like pipenv.

Anyone have thoughts on best practices?

shrike82
Jun 11, 2005

punished milkman posted:

PyCharm (pro edition) has support for remote debugging with Docker.

i use that, as well as jupyter notebooks connected to a server running within the container. honestly, the latter works for me most of the time because of the nature of my work - ML model building.

both are finicky so i'm hoping someone has a better workflow

shrike82
Jun 11, 2005

Docker is compelling if your application has non-python dependencies, and your team uses heterogeneous, potentially shared, development machines

My particular use case requires GPU access and we have a hierarchy of dev machines - from MacBook Pros/Linux gaming laptops to in-house multi-GPU linux servers to AWS GPU instances. Docker papers over the various hardware setups. It also allows for pretty fine-grained isolation, down to exposing individual GPUs to a given instance, so this helps a lot with maximizing usage of the in-house servers.

shrike82
Jun 11, 2005

Dominoes posted:

I'm trying to repro a bug using docker and it's been a true PITA. Docker's rubbing me the wrong way, STS. Multiple unauthorized reboots, BIOS setting changes, ghost servers running long-deleted code etc. Don't care it's popular - this isn't polite behavior. Don't like.

That's pretty unusual...

shrike82
Jun 11, 2005

Tesseract's models for vision are quite dated. You're better of feeding the images into a Google, Microsoft etc. online CV API.

shrike82
Jun 11, 2005

I'd suggest trying out samples of your handwriting on Tesseract and an online API service.

It's really a night and day thing and this specific domain is an area where deep learning techniques have shown themselves to perform better without any caveats. There's a reason why digit recognition (MNIST) is treated as the "hello world" equivalent for AI.

shrike82
Jun 11, 2005

There's an amazing Python ebook bundle on Humble bundle - https://www.humblebundle.com/books/python-oreilly-books

Fluent Python alone is worth the price of entry.

shrike82
Jun 11, 2005

Most Python guides are for beginners, Fluent Python is targeted at intermediate to even advanced developers, examining the language in depth. It does a good job discussing development best practices, explaining language plumbing, as well as providing practical code walk-through of major use cases/packages.

shrike82
Jun 11, 2005

If you login to your accounts page, you should be shown a bundle page with a bunch of links to download the ebooks in various formats. Enjoy!

shrike82
Jun 11, 2005

As an aside, do you guys learn mainly from written stuff (books, articles etc.)?
I was just musing with some colleagues that younger developers tend to be more comfortable with watching videos to learn about technical stuff. I've seen some of the stuff they watch are deep dives and not just 101 stuff. The flip-side is they don't touch (O'Reilly or whatever) books at all - sticking to online stuff.

shrike82
Jun 11, 2005

QuarkJets posted:

Tell me more about these antics

:same:

shrike82
Jun 11, 2005

The guy who "outed" him was fairly thoughtful in explaining why a public blog post was his last step in a bunch of things he had tried to resolve the situation.

In any case, Reitz scrubbing all details about the fundraiser from his website seems pretty dirty.

shrike82
Jun 11, 2005

punished milkman posted:

Anyone have any package suggestions for extracting tables of data from image files (.png/.jpg) ? I tried using Tesseract/pytesseract and while it's doing a great job of detecting the text, the tabular aspect of it is totally lost and I couldn't find a straight forward path to processing tables with it. I've used Camelot with PDFs before, and it worked OK (at best), but I'm hoping to use something else this time around.

Table detection isn't a solved problem even with current deep learning models.

shrike82
Jun 11, 2005

Err is there a reason why you're building TF manually? There are official TF images on dockerhub for almost every combination of Python, CPU/GPU you'd be interested in.

shrike82
Jun 11, 2005

General_Failure posted:

To build without AVX opcodes, and to support CUDA3.0

I'm no help there but out of curiosity, what's your use-case? Some kind of embedded CV application?

shrike82
Jun 11, 2005

My bad... Why not give Google Colab a try? It's free and would let you play with CV frameworks?

shrike82
Jun 11, 2005

You should be able to export the token data in a spacy Doc directly to numpy arrays and then use the usual numpy serialization methods.

Adbot
ADBOT LOVES YOU

shrike82
Jun 11, 2005

Business posted:

Thanks for the responses all they are helpful. Going to look into storing as numpy arrays and/or being more selective about what I store. The main issue is that my texts are large enough that I have to stream it as a chopped up list of Docs, and then find a way to store that and the associated vocab data without resorting to writing it to bytes. The blobs would give me everything, and they good for my own offline purposes, but its helpful to have it confirmed that it's weird to store a big binary data thing for each file.

Preprocessing texts in spacy shouldn't be that compute intensive so consider just storing the raw texts and computing on demand especially if users are given the option to upload new texts.

Trade CPU time for storage - even with numpy, I suspect you're going to be dealing with large objects since there's a lot of token level metadata.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply