Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
9-Volt Assault
Jan 27, 2007

Beter twee tetten in de hand dan tien op de vlucht.
As another data guy, i cant really remember the last time i ran a regular python script. Notebooks are good and cool and i use Papermill all the time to run them like a script.

Even cooler is Julia's Pluto library, which is notebooks combined with Excel-like reactivity.

Adbot
ADBOT LOVES YOU

Dominoes
Sep 20, 2007

I assumed notebooks were for sharing results, especially with interactive visuals.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Dominoes posted:

I assumed notebooks were for sharing results, especially with interactive visuals.

When I share results I output the notebook to PDF. The few times I’ve needed to do this it’s always with nontechnical people.

a foolish pianist
May 6, 2007

(bi)cyclic mutation

I use notebooks for research and repeatable tinkering, then just rewrite anything that's going into production in more traditional python format.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

a foolish pianist posted:

I use notebooks for research and repeatable tinkering, then just rewrite anything that's going into production in more traditional python format.

Same for web scrapers in selenium. Notebooks make it so much faster to iterate through the various steps.

OnceIWasAnOstrich
Jul 22, 2006

CarForumPoster posted:

When I share results I output the notebook to PDF. The few times I’ve needed to do this it’s always with nontechnical people.

Yeah the promise of notebooks being useful to share with non-technical people was a little oversold for me. Anyone non-technical isn't going to 1) install Python and associated libraries 2) be able to run one of my notebooks 3) care about the code. I can make a PDF but it doesn't really get me much when with just a little more work I can make a Shiny/Dash app and get actual interaction. I know there are services to kinda-sorta host them but :effort:.

I've experimented a bit with using them to quickly prototype a bit of an overall analysis pipeline and then integrate it into a bigger Snakemake pipeline and that worked well enough. Nothing groundbreaking but it probably did save me some time and if I need to dig in more I can pretty easily.

I use them extensively for loving around and experimenting since having visualizations or big chunks of data pop out semi-formatted in a browser cell is (usually) more useful than having to write extra code for it in a REPL via SSH or having to use a separate program or X server to let matplotlib poo poo appear over SSH. These days this usually goes hand-in-hand with using it to test out changes to an actual library that I'm writing in a real IDE. Problems with multiprocessing or async issues do pop up in certain contexts that cause my to have to go entirely out of the notebook environment though.

I've also found myself extensively using Colab for teaching students since neither they nor I have to set anything up and they can't possibly break anything that can't be fixed by hitting "Factory Reset Runtime". If I can get them to not overwrite the code cells that worked it helps a little bit early on because previous work doesn't "disappear" in their brains the way it seems to on the CLI or REPL.

Hed
Mar 31, 2004

Fun Shoe
Do you any recommendations of alternatives to scrapy?

The problem I'm trying to solve is I built out one scrapy workflow that goes on a task queue and given Twisted reactor and the threading/singleton stuff I've run in to, it's just obvious I am really trying to "get around" a very opinionated framework. If this means I need to learn it better I'm willing to, but at this point I'm ready to consider my current implementation a throwaway prototype if there's something else.

My requirements would be just basic HTTP scraping, selection on XPaths, populating/POSTing form data, with the ability to fire up Selenium if I need to. I am typically saving entire documents instead of snippets.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Hed posted:

Do you any recommendations of alternatives to scrapy?

The problem I'm trying to solve is I built out one scrapy workflow that goes on a task queue and given Twisted reactor and the threading/singleton stuff I've run in to, it's just obvious I am really trying to "get around" a very opinionated framework. If this means I need to learn it better I'm willing to, but at this point I'm ready to consider my current implementation a throwaway prototype if there's something else.

My requirements would be just basic HTTP scraping, selection on XPaths, populating/POSTing form data, with the ability to fire up Selenium if I need to. I am typically saving entire documents instead of snippets.
I've never used scrapy, but I've written python web scrapers for a couple dozen sites, mostly behind logins and often with multiple steps.
I'm partly through converting our scrapers, which are all python + selenium running on one machine to running all on AWS lambda using headless chrome following an AWS Step Function. Haven't finished yet but its looking like it'll be highly flexible, trivially scheduled, AWS SAM makes deployments easy, and let us have some pretty complex business logic that needs to go along with the scrapers.

And also cost almost nothing because it's AWS lambda and these things combined only have like 2 hrs of compute time per day. We're doing this ahead of building out a large number of scrapers and lead enrichment functions that need to fail in easily tracked ways and be able to quit being enriched based on various logic.

Let me know if I can help, happy to share what I've learned....it has been a steep learning curve trying to make these work reliably.

unpacked robinhood
Feb 18, 2013

by Fluffdaddy
Can you run headless chrome inside a lambda ?
I wrote a selenium/chrome container thing recently and I wouldn't know how to convert it to a lambda.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

unpacked robinhood posted:

Can you run headless chrome inside a lambda ?
I wrote a selenium/chrome container thing recently and I wouldn't know how to convert it to a lambda.

I can say for sure you can and that the SAM (lambci docker container) emulates it decently well so you can test locally before deploying.

Theres no conversion needed, really you just need a self contained binary. That said, it gets to 250MB quickly but I got around this by adding EFS to my lambda.

EDIT: I should caveat that by no "conversion" needed I mean that you can use your code mostly as written. Obvs youll need to point to chrome executable and chromedriver. I put mine on EFS. You'll probably need to write some wrapper for the actual scraper code that is called by the lambda_handler though. Also logging will need some sort of treatment. The easy answer if you append logs to a local file now is to simply log to a file you append to in EFS. We havent made a final decision on cloudwatch versus this log style yet.

CarForumPoster fucked around with this message at 03:03 on Dec 22, 2020

Macichne Leainig
Jul 26, 2012

by VG
A bit late to notebook chat, but (as a probable Luddite) I exclusively use regular Python scripts. I love the argparse module. Takes a bit to get the scripts into a good spot, but once you do it's real nice to just be like python script.py do-thing resources=/var/path/to/resources whenever you need to kick off something.

mystes
May 31, 2006

Hed posted:

Do you any recommendations of alternatives to scrapy?

The problem I'm trying to solve is I built out one scrapy workflow that goes on a task queue and given Twisted reactor and the threading/singleton stuff I've run in to, it's just obvious I am really trying to "get around" a very opinionated framework. If this means I need to learn it better I'm willing to, but at this point I'm ready to consider my current implementation a throwaway prototype if there's something else.

My requirements would be just basic HTTP scraping, selection on XPaths, populating/POSTing form data, with the ability to fire up Selenium if I need to. I am typically saving entire documents instead of snippets.
Unless you want to recursively crawl sites, you probably don't need a specific scraping library like scrapy in general. If you just have a list of pages you want to get content from it's probably easier to do it manually.

If you want to use simple html parsing for some pages and selenium for other pages that may make things a little more complicated, though.

Dominoes
Sep 20, 2007

Re the notebook chat: It's always something I've had in my hip pocket as a tool I'll know is available when the use came up. I was surprised to hear (partly from articles critical of its non-linear excecution) how it's often used in practice, ie for broader use cases.

mystes
May 31, 2006

Dominoes posted:

Re the notebook chat: It's always something I've had in my hip pocket as a tool I'll know is available when the use came up. I was surprised to hear (partly from articles critical of its non-linear excecution) how it's often used in practice, ie for broader use cases.
The "hidden state" issue isn't something you're normally going to run into as long as you're at all careful, and notebooks are very good for some applications. Actually the situation of web scraping someone was just asking about is a good example because you want to keep the state around while you're working on it and it's much poke around with web pages interactively, and it's a lot easier to organize that in a notebook than a repl.

There are lots of situations where notebooks aren't as good (basically anything that you could easily write in an IDE without access to runtime information).

NinpoEspiritoSanto
Oct 22, 2013




mystes posted:

The "hidden state" issue isn't something you're normally going to run into as long as you're at all careful, and notebooks are very good for some applications. Actually the situation of web scraping someone was just asking about is a good example because you want to keep the state around while you're working on it and it's much poke around with web pages interactively, and it's a lot easier to organize that in a notebook than a repl.

There are lots of situations where notebooks aren't as good (basically anything that you could easily write in an IDE without access to runtime information).

Finally I have something to try notebooks for

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Bundy posted:

Finally I have something to try notebooks for

I need to do a one-off scraping of a website every month or so and jupyter is the best for this by far. Here's an example workflow

Requirement:
For purposes of recruiting an attorney barred in states A and B who has experience in labor and employment AND/OR business litigation, get a list of all attorneys that State Bar B says are barred in state A and B.

How I use Jupyter to make this very fast:
Fire up selenium, manually navigate to the bar's search page, search manually, then have a quick n dirty selenium script grab the results from each page and put them in a DataFrame I dump to excel.

I was able to get a list of probably every possible candidate in the nation that fulfilled the requirements including very detailed other info like practice areas of law, what year they graduated law school, phone numbers, emails, where they're working now, etc. This is way better than any head hunter or hitting up linkedin could ever do and it was a 2 hour job. I was able to find 40 potential candidates for some exceptionally rare job requirements in just a few hours.

Sad Panda
Sep 22, 2004

I'm a Sad Panda.
I've only seen Jupyter being used when I did the MIT Intro to CS & Programming course.. However, I have spent many days writing Python to scrape using Selenium. Are you saying that instead of having to re-run the script every time I wanted to try out a new element I could have just used Jupyter instead? I used PyCharm's debug mode to write, but Jupyter seems like it would have been a lot simpler.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Sad Panda posted:

I've only seen Jupyter being used when I did the MIT Intro to CS & Programming course.. However, I have spent many days writing Python to scrape using Selenium. Are you saying that instead of having to re-run the script every time I wanted to try out a new element I could have just used Jupyter instead? I used PyCharm's debug mode to write, but Jupyter seems like it would have been a lot simpler.

Yep!

Bad Munki
Nov 4, 2008

We're all mad here.


CarForumPoster posted:

a one-off...every month
:colbert:

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Its a different one-off every month.

Some examples have been:
-All PACER court records for a particular plaintiffs attorney
-All State A records of attorneys barred in states A and B
-All State B records of attorneys barred in states A, B and C
-All ______ county records where the defendant is Liberty Mutual Insurance
etc.

These come up about once per month and I can bang out something good enough to be useful but not ready for production in about 2 hours thanks to jupyter. I am actually interviewing two people I found this way who, if joining, could represent a 25% increase in revenue over the next year or so. The business impacts of this system are pretty huge.

hhhmmm
Jan 1, 2006
...?

Sad Panda posted:

I've only seen Jupyter being used when I did the MIT Intro to CS & Programming course.. However, I have spent many days writing Python to scrape using Selenium. Are you saying that instead of having to re-run the script every time I wanted to try out a new element I could have just used Jupyter instead? I used PyCharm's debug mode to write, but Jupyter seems like it would have been a lot simpler.

You can use "scientific mode" in pycharm, and get the best of both worlds.. Basicly you create cells using #%%, and then execute code stepwise, but still get all of the support that pycharm provides. With all variables available for inspection in the variable inspector or whatever it is called.

Butter Activities
May 4, 2018

For me notebooks were the thing that got me the slog of getting to at least to the point where I can write simple actual programs. The immediate reactivity and ability to see what variables are doing step by step without having to use a debugging program was amazing.

Even though I do most stuff in nano or sublime text now, when I’m dealing with a python concept I’m not familiar I go to notebooks to gently caress around with it first.

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.
So I'm at the point where I'm finally putting together a real python project and I'm trying to understand package/module structure better. I'm working on an ML project where I want to test different pipelines by stringing together different combinations of data feeders, models, and trackers. Don't need to get into too much detail about this, but, the way I've gone about structuring my package is as follows:

code:
MLP (top-level package folder)
---> models
---> trackers
     ---> __init__.py
     ---> trackerA.py
     ---> trackerB.py 
Assume that within trackerA.py we have BobsTracker and in trackerB.py we have JoesTracker. Then, when coding, I'd have to use eg.,

code:
from MLP.trackers.trackerA import BobsTracker
from MLP.trackers.trackerB import JoesTracker
However it seems kind of unnecessary/redundant to have to include the two "parts" of the namespace path. Like, ideally, I'd just want to do this:

code:
from MLP.trackers import BobsTracker 
from MLP.trackers import JoesTracker
But I'm not sure how to do that without requiring that somehow everyone put their trackers into the same file. Maybe some __init__.py trickery can be helpful here?

SurgicalOntologist
Jun 17, 2004

Cyril Sneer posted:

But I'm not sure how to do that without requiring that somehow everyone put their trackers into the same file. Maybe some __init__.py trickery can be helpful here?

Yeah that's basically it. I think there are different approaches but what I do is I think for every __init__.py "is there anything in this package [folder with an __init__.py] that I want to be able to import from the level above?" and if the answer is yes, do this (using your example):

Python code:
# MLP/models/trackers/__init__.py
from MLP.trackers.trackerA import BobsTracker
from MLP.trackers.trackerB import JoesTracker
Then in anywhere else you can do

Python code:
from MLB.trackers import BobsTracker, Joes Tracker
Edit: Essentially, MLB/trackers/__init__.py corresponds to MLB.trackers in the same way that MLB/trackers/trackerA.py corresponds to MLB.trackers.trackerA. Anything you want to be accessible in MLB.trackers, you must define or import in MLB/trackers/__init__.py.

SurgicalOntologist fucked around with this message at 19:52 on Dec 26, 2020

CarForumPoster
Jun 26, 2013

⚡POWER⚡

SurgicalOntologist posted:

Python code:
# MLP/models/trackers/__init__.py
from MLP.trackers.trackerA import BobsTracker
from MLP.trackers.trackerB import JoesTracker
Then in anywhere else you can do

Python code:
from MLB.trackers import BobsTracker, Joes Tracker

Also what I do.

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.
Great, thanks, that was helpful!

Zoracle Zed
Jul 10, 2001
just a minor suggestion that's it's also cool & good to avoid the import convenience shuffling as long as possible and stick to slamming alt-enter in pycharm as long as possible. honestly even then i'd still prefer something siloed off like "from butts.quickstart import Butt, Fart, poo poo"

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.
As a follow up, is there a way to allow dynamic assignment of classes? In order to assemble my pipeline, I was thinking of using a configuration dictionary, with the keys specifying the particular model to use, amongst other things.

I.e., something like:

code:
config_dict = { 'predictor': modelA }
I suppose one way would be to use a string value...

code:
config_dict = { 'predictor': 'modelA' }
then in code, I do something like

code:
if config_dict['predictor'] == 'modelA':
    # hard coded loading of model A
But the need to do a kind of string lookup seems cheap somehow, and I'm trying to think how to maintain a linkage between a class and its lookup name (might not be wording this very clearly). Anyway, this seems like probably a solved problem, so some guidance would be helpful!

necrotic
Aug 2, 2005
I owe my brother big time for this!
Look at the __import__ function.

Phobeste
Apr 9, 2006

never, like, count out Touchdown Tom, man

Cyril Sneer posted:

As a follow up, is there a way to allow dynamic assignment of classes? In order to assemble my pipeline, I was thinking of using a configuration dictionary, with the keys specifying the particular model to use, amongst other things.

I.e., something like:

code:
config_dict = { 'predictor': modelA }

You can literally do this, classes are objects too and defining them with the class keyword creates an entry for that class object. You can call them to call their constructors, too

code:

class A:
   def __init__(self): print("im an A")

config_dict = {'A': A}

config_dict['A']()  # prints "im an A"
Not sure if that's what you're looking for, but it is a thing.

OnceIWasAnOstrich
Jul 22, 2006

You can also getattr the class name out of the module that the class is defined in. If you need to pull specific classes based on something like a text config file this works well.

Assuming all of your models are present in the models module:

Python code:
from mypackage import models
predictor = 'modelA'
model_cls = getattr(models, predictor)
model = model_cls()

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.

Phobeste posted:

You can literally do this, classes are objects too and defining them with the class keyword creates an entry for that class object. You can call them to call their constructors, too


Thanks, yes I know this and probably shouldn't have used that example. I don't want to have the class object itself in the dictionary, but rather, use a descriptive entry for ease of use. So the user would just specify something like "vgg" or "resnet" and that would map to the appropriate class.

Dominoes
Sep 20, 2007

Cyril Sneer posted:

As a follow up, is there a way to allow dynamic assignment of classes? In order to assemble my pipeline, I was thinking of using a configuration dictionary, with the keys specifying the particular model to use, amongst other things.
...

I recommend dataclasses for configs, since they likely have a fixed set of settings (fields), and the values are of different types. Comparatively, dictionaries leave you open to errors that will surprise you at runtime.

Use an Enum to define allowed classes that can be in the config:
Python code:
class SelectedModel(Enum):
    MODEL_A = auto()
    MODEL_B = auto()
    # ...


@dataclass
class Config
    predictor: SelectedModel

config = Config(SelectedModel.MODEL_A)

Dominoes fucked around with this message at 21:16 on Dec 28, 2020

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.
Interesting, thanks. I'm a bit confused by your example code though. Is MODEL_A an actual class definition (in which case how can that assignment work?), or something else?

Dominoes
Sep 20, 2007

SelectedModel.MODEL_A is an enum variant. You can think of an enum as a way to list choices. (Or as a binary with more than 2 variants, and semantic meaning to each variant.)

When processing your config, you run different code depending on which variant the predictor field holds. This way your IDE etc will only allow certain classes to be selected in the config. Enums allow you to specify only valid models. Your example used strings; presumably not every string (or every class) is a valid option in your config!

If you plan to serialize your config, instead of using enum.auto(), specify an integer for each variant, so your serialization is consistent.

Dominoes fucked around with this message at 22:00 on Dec 28, 2020

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.
double post, ignore.

Da Mott Man
Aug 3, 2012



I do something like this with a discord bot I wrote where I abuse importlib to hot load/unload plugins without restarting the bot.

Commands check the plugin list for an attribute that matches the command list in each plugin and fires off coroutines in parallel.

wolrah
May 8, 2006
what?
I understand the use of virtualenvs and such for keeping different projects' dependencies isolated from the system and each other, but if I actually do want the latest version of something installed system-wide is sudo pip install actually the best way to go?

My specific use case right now is for youtube-dl but there are a few other utilities written in Python that use pip as their official package manager but I'd generally want to have installed system-wide with no need to activate environments before use.

Dominoes
Sep 20, 2007

In Windows, pip install is fine. On Linux, you risk putting your system in a totalled state due to system reliance on the python install it comes with.

Try this:
-Download the latest Python source from python.org
-Install it using `configure`, `make` and `sudo make install` from the directory you unpacked it to
-Use its pip: `python3.9 -m pip install apackage`.

This way, you don't risk modifying a package your OS relies on. Or, roll the dice with your system python's pip. It will probably be fine.

Dominoes fucked around with this message at 19:39 on Dec 30, 2020

Adbot
ADBOT LOVES YOU

Bad Munki
Nov 4, 2008

We're all mad here.


Similarly on OSX messing with system Python will wreck your poo poo, just absolutely don’t.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply