|
So uh in my youth I played World of Warcraft and did not study in university, altough I was enrolled there for quite many years. Later I studied to be a x-ray technician. I graduated in 2016 and now I am very loving bored indeed. I have tried to get work closer to home but no my commute is 95 kilometers 1 way (a long trip) still, after four. There simply are not jobs closer to home (at least for me). Maybe coding wouldn't be so bad after all instead of this healthcare poo poo? Lots of junior openings nearby where I live. Tons. Massive amounts of all kinds of programming job openings. Maybe I could be a Data Engineer or something instead of fullstack?! I know some basic SQL but they do something with python so I need to learn python. I have not done any programming in 10 years. I looked at options like Freecodecamp and did yesterday and today the first 2 exercises: https://repl.it/@Ihmemies/MeagerNeatPerl https://repl.it/@Ihmemies/fcc-time-calculator The unit tests pass so they "work" -> send it. Maybe I'll do those FCC Python exercises because I don't know what else to do. How to learn python? What would be important for a Data Engineer, where to concentrate? I need to be able to produce decent quality code to impress some dudes so they would hire me. Then there is this: https://www.datacamp.com/tracks/data-engineer-with-python And this: https://www.codewithrepl.it/python-projects-for-beginners.html Maybe this: https://www.udemy.com/course/complete-python-bootcamp/
|
# ? Aug 26, 2020 12:46 |
|
|
# ? May 15, 2024 04:48 |
|
Ihmemies posted:How to learn python? What would be important for a Data Engineer, where to concentrate? I need to be able to produce decent quality code to impress some dudes so they would hire me. If you're learning to get a job, find a job req for a position that looks good and go down the list studying all of the needed skills/technologies. Build a project related to that job. Python may be great for a data engineer position. IMO the best way to learn is to start with a personal project you care about and struggle through it and the tutorials needed to complete it. Ask questions here along the way. Starting with a problem in a domain youre knowledgeable about, like one of your hobbies or your current job, is a good bet. Also download pycharm, it will teach you most of the good habits you need to succeed by yelling at you with squiggly lines.
|
# ? Aug 26, 2020 15:04 |
|
CarForumPoster posted:TLDR: How can I figure out the minimum number of packages needed in a venv/my environment? I'm not sure I understand your question but aren't you supposed to use layers for that stuff ? Asking cause at some point I made a script that takes a requirement.txt and spits a layer.zip that you can push to AWS. e: work proxy won't let me pastebin but I can upload it later. unpacked robinhood fucked around with this message at 15:23 on Aug 26, 2020 |
# ? Aug 26, 2020 15:17 |
|
CarForumPoster posted:The venv folder that pycharm creates is 75MB with only pip20 and setuptools...is that normal? I can't speak for anything else but I just looked at the folder sizes for two virtual environments I have - one has pretty much nothing and is 9.4MB, and one is an environment I use to teach Django and that's 48MB (Django is about 33MB of that). To test something out, I started a new PyCharm project and had it make a venv, and it's 14.9MB. It looks like PyCharm's process for creating a venv in shoving some extra stuff in there. pip list shows not much installed, but some of the folders inside the venv have some extra stuff. I'm not familiar enough with how venvs are made to explain why PyCharm's venv creation inflates the size of the environment, but there's a difference for sure. It shouldn't be 75MB, though. Unless it's a difference of OS?
|
# ? Aug 26, 2020 22:06 |
|
Appreciate the replies. Ill def try creating my venvs outside of pycharm next time. I ended up giving up, moving to a linux development environment and got it deployed.
|
# ? Aug 26, 2020 23:54 |
|
Can you talk about your Zappa experience? I’m curious about using it for a Django problem but last I looked the maintainer had kind of noped out from appearances.
|
# ? Aug 27, 2020 00:41 |
|
Hed posted:Can you talk about your Zappa experience? I’m curious about using it for a Django problem but last I looked the maintainer had kind of noped out from appearances. I'm still an AWS novice. I prefer not to spend my time server-fuckling which is why I was drawn to Lambda/Zappa. It kinda seems less actively supported to me as well but if you're doing the things it is designed to do, such as putting a Django site on Lambda, it can be pretty great. Having never made a serverless app before (our other Django site is on EBS) I fired up a cloudformation template and had a serverless django site with a corny bootstrap template, RDS DB, django CMS and Rest API on an actual domain with a ACM SSL cert in no time (maybe 1-2 days?). It went fast. It runs locally with no issues. I'd say the learning curve was maybe slightly less than with deploying to EBS via the AWS CLI. When you're not doing what it does well, or when it just decides to gently caress you, its hard to diagnose whats up. Its not documented very well and the error handling isn't great. I've wasted easily 10+ hours with issues that turned out to have absoultely positively nothing to do with the root cause. That GitHub Readme is it as far as I can tell. It isnt ubiquitous enough that every error is well documented and the errors usually come in the form of a CloudWatch log that says "NoneType is not subscript-able". I may have had less hard of a time if I properly set up AM2 from the start but I mostly develop on Win10 (yea I know)
|
# ? Aug 27, 2020 01:00 |
|
Feel like some might consider this golf-y compared to a full try/except, opinions?Python code:
|
# ? Aug 27, 2020 01:55 |
|
KICK BAMA KICK posted:Feel like some might consider this golf-y compared to a full try/except, opinions? From both an (admittedly probably meaningless) performance (much more important) readability standpoint I would only do this if it was extremely likely that in the normal course of operation that exception would happen. Like at least a 50/50 and maybe more often than not. It adds so much emphasis to saying yup this might happen, yup it for sure might, compared to a try/except (and towards perf, a try/except will typically be implemented to have little effect in the no-exception-raised path since that's the more common case). Everything about try/except style exception handling is designed for exceptions as, well, exceptions - things that can happen that aren't intended but can still be accounted for on a "oh that's not what I wanted, I guess I'll try this other thing instead". That's why the performance is designed to have little or no impact on the no-exception pass, that's why the weight of the syntax is in the exception handler. Switching that up so that the work is done in either case doesn't make me think "oh this code is particularly conscientious", it makes me think "this exception must happen constantly, I wonder if somebody's using it for flow control".
|
# ? Aug 27, 2020 02:52 |
|
Ihmemies posted:So uh in my youth I played World of Warcraft and did not study in university, altough I was enrolled there for quite many years. Later I studied to be a x-ray technician. I graduated in 2016 and now I am very loving bored indeed. I have tried to get work closer to home but no my commute is 95 kilometers 1 way (a long trip) still, after four. There simply are not jobs closer to home (at least for me). Data Engineering is a broad field, but SQL is a good start. In terms of python, here is what I look for when hiring people into junior roles that don't necessarily require a lot of experience, but where I look for potential and interest more than anything: - Interact with a database from Python (SQLalchemy Core or ORM help here) - Get data from APIs (requests, httpx etc. help, as do knowledge of HTTP, REST etc.) - Parse/write JSON - Parse/write XML - Parse/write CSV - Loops and list/dict comprehensions (i.e. incremental changes over sequences of uniform data) - Iterators/lazy evaluation - Threading/Multiprocessing/Queues - Dataframes (mainly stuff like pandas) - I/O and file system interaction - Regex If you want to combine these things in a little project, maybe have a look if you can find an API or two which provide data about something you are interested in. Build a process that reads, parses, cleans, and transforms that data, maybe enriches it with some additional data, then write it all into a database (look at normalisation and table design as well). Do some calculations on top of the data, or read data from the database again for further enriching. Sports, gaming, stock markets, public transport etc. all work well for this kind of thing. I think of data as potentially infinite sequences of key/value pairs (or sequences thereof), which means I ultimately want to operate on each entry/row individually, applying something to the row, and basically compose these transformations together to get a pipeline. Essentially, have a look at the idea of map/reduce/filter. Basically, learn functional programming in a proper language. I am half kidding, but the concepts help a ton, though python makes sense for the domain because data scientists etc use it a fair bit as well. Hollow Talk fucked around with this message at 08:30 on Aug 27, 2020 |
# ? Aug 27, 2020 07:31 |
|
Thanks for good answers! I could make a demo project, following the lines of "code with repl.it python projects for beginners". I could use the Finnish meteorological institute's open data API to get some historical weather data for example, parse & store it and allow downloading customizable parts of the data as csv/xml/something. For example. And make some tables/charts from the data or otherwise process it.
|
# ? Aug 27, 2020 09:27 |
|
Eval is not very good. But it works here and this accepts no user input. So... could I use something else instead? Any ideas? I'm calling numpy functions with the dictionary key names..
|
# ? Aug 28, 2020 09:34 |
|
Couldn't you just use the functions themselves as dictionary keys? I'm laying in bed and way too lazy to open repl.it on my phone, but I'm pretty sure Python dictionary keys can be any kind of object, including a function - you just wouldn't call it in the dictionary. e: since you're returning the dictionary that sounds non-ideal - you could create a list of function/string tuples and iterate over that; call the first part of the tuple and update the dictionary using the key of the second part death cob for cutie fucked around with this message at 10:07 on Aug 28, 2020 |
# ? Aug 28, 2020 10:02 |
|
Epsilon Plus posted:Couldn't you just use the functions themselves as dictionary keys? I'm laying in bed and way too lazy to open repl.it on my phone, but I'm pretty sure Python dictionary keys can be any kind of object, including a function - you just wouldn't call it in the dictionary. Yes this seems to be possible: quote:e: since you're returning the dictionary that sounds non-ideal - you could create a list of function/string tuples and iterate over that; call the first part of the tuple and update the dictionary using the key of the second part And this possibly too.. I will try this now Thanks for help. At least in this case even altough the original solution "works", it was so bad there was good reason to try to make it better. Edit: I guess you meant something like this. Ihmemies fucked around with this message at 10:42 on Aug 28, 2020 |
# ? Aug 28, 2020 10:09 |
|
To be more specific I was thinking something like this:code:
|
# ? Aug 28, 2020 22:37 |
|
Ihmemies posted:How to learn python? What would be important for a Data Engineer, where to concentrate? I need to be able to produce decent quality code to impress some dudes so they would hire me. https://towardsdatascience.com/pandas-for-newbies-an-introduction-part-i-8246f14efcca SQL is going to be super important, but Pandas is how data science people do the data science. The more you understand how people are going to actually *use* the data, the more you can understand the infrastructure required to support that. Plus Pandas dataframes are loving magic. Just discovered this thread. Love Python. I work for a primarily PHP-oriented shop, where all of our ETL stuff was shoehorned into PHP for ages (in the interest of homogeneous projects, I guess). The past few years I've been trying to migrate that side of our infrastructure to Python and it's been so much fun.
|
# ? Aug 29, 2020 01:56 |
|
Epsilon Plus posted:To be more specific I was thinking something like this: I finally realized that too... So I did it like that! 26 lines of code took many hours. I guess it was a good exercise ( https://repl.it/@Ihmemies/fcc-mean-var-std#mean_var_std.py ) The next one is using Pandas with a csv of 35000 census entries. I don't currently like pandas but maybe I'll learn to like them more after a while. Ihmemies fucked around with this message at 07:02 on Aug 29, 2020 |
# ? Aug 29, 2020 06:56 |
|
Ihmemies posted:The next one is using Pandas with a csv of 35000 census entries. I don't currently like pandas but maybe I'll learn to like them more after a while. Pandas is the bees knees for getting your data in order or doing some data/feature engineering before passing to the DL/ML/NLP library of your choice
|
# ? Aug 29, 2020 07:29 |
|
How can I fire off a subprocess, move on with execution, and have a callback or do something small post-execution? I have a bunch of finite state machines that I go through and update every second in the main program loop. Sometimes the FSMs, which are python objects, run a function that does a subprocess call out to a system binary that isn't computationally expensive but can easily take several seconds to complete waiting on a response. This sucks and blocks things up. Right now that blocking call is in a subprocess.run(). I've looked at subprocess.Popen() but it's kind of a fire and forget... I guess I could Popen, then track state and see when that is done and if it had the desired effect, but I feel like I'm missing something more obvious. Something that allowed me to fire off a subprocess and do a callback or change a flag later so I can keep the main loop going would be awesome. I've looked at asyncio and Popen like I mentioned but I feel like I'm missing something obvious in the std lib. Bruegels Fuckbooks posted:I'd just use mmap for this - memory mapped files generally work well for this use case. I forgot to respond to this but thank you--this worked splendidly for what was essentially sending over a sequence of integers.
|
# ? Aug 29, 2020 21:34 |
|
ThreadPoolExecutor from concurrent.futures (which is in vanilla Python) will let you attach a callback function to the Future object that gets returned when you submit a task, this is the easiest way to perform a "fire and forget" operation that doesn't block your main thread. This sounds like it's exactly what you want
|
# ? Aug 29, 2020 22:16 |
|
CarForumPoster posted:Pandas is the bees knees for getting your data in order or doing some data/feature engineering before passing to the DL/ML/NLP library of your choice Just use it for structured data of any kind. I deal with a ton of CSVs and filtering, sorting etc. which is all dead easy with Pandas. The worst part might be having to define the dtypes for a csv with many columns (and that's really not that bad).
|
# ? Aug 31, 2020 14:29 |
|
you prefer it over the 'csv' module? i've been recently learning how to deal with csv files in python and most things i found there deal with that module. i'm simply writing from existing csv and writing to new ones.
|
# ? Aug 31, 2020 15:37 |
|
abelwingnut posted:you prefer it over the 'csv' module? i've been recently learning how to deal with csv files in python and most things i found there deal with that module. i'm simply writing from existing csv and writing to new ones. Use the right tool for the job. Basic CSV processing doesn't require fancy Pandas wrapping. Pandas dataframes are more for major data manipulation - subsetting, reorganizing, sorting, filtering, summary stats, etc. That said if your task is "read a CSV, drop a column, and write it out," pandas might save you some headache.
|
# ? Aug 31, 2020 16:33 |
|
xarray is pretty sick for multidimensional arrays where a 2d table just isn't enough
|
# ? Aug 31, 2020 19:17 |
|
abelwingnut posted:you prefer it over the 'csv' module? i've been recently learning how to deal with csv files in python and most things i found there deal with that module. i'm simply writing from existing csv and writing to new ones. I use pandas for any data that is table-like. pd.read_csv() has some very useful params for dealing with CSV issues like: multi GB, corrupt, non comma delimited database dumps with line errors.
|
# ? Aug 31, 2020 19:54 |
|
Hey y'all - Looks like I came here at just the right time, because I, too, am dealing with..."issues" with a CSV file. So, the long and short of it is that I'm trying to automate my job on the sly, because gently caress breaking my back so Jeff CEO can buy another loving yacht. One of our vendors offers a Quote Report that we use to quote our customers in a variety of formats (all of which are Extremely hosed Up And Stupid, as I'll demonstrate shortly) So I export this CSV version of the report and am expecting something simple, like: code:
code:
|
# ? Sep 2, 2020 21:00 |
|
I think something like re.sub() handles this, but I haven't used it before so I can't provide an offhand example for you
|
# ? Sep 2, 2020 21:15 |
Marx Was A Lib posted:Hey y'all - Are they actually that bad, i.e. most fields missing, some fields aligned with their headers in rows, some misaligned? If you've just trimmed some stuff and "Quote Description" actually should be the description instead of the customer name in your example, then you might be able to go line by line through the records and do something like this: code:
a foolish pianist fucked around with this message at 23:15 on Sep 2, 2020 |
|
# ? Sep 2, 2020 22:08 |
|
imo any ad hoc bandaid for hosed up data like that is going to be an accelerating spiral of immiserating maintainance work--just say no!
|
# ? Sep 2, 2020 22:28 |
|
This kind of data doesn't look like too much effort to work with. A lot of "data analysis" work is just figuring out the most effective way of ingesting the data. Sometimes that means dealing with spreadsheets that were inconsistently managed by a salesman with advancing dementia; that's part of the job
|
# ? Sep 2, 2020 22:59 |
|
Data in a dumb format isn't a big deal to ingest as long as you know what the dumb format is. Data in an inconsistent format is a bigger pain in the rear end.
|
# ? Sep 2, 2020 23:07 |
|
Marx Was A Lib posted:Hey y'all - 1. Install Jupyter Notebook locally. Look up some basic tutorials on using it with Python, but it's not anything other than a visual interactive Python shell with some GUI niceties thrown in. (https://jupyter.org/) 2. Get Pandas and its dependencies installed in your Python environment. Also plenty of tutorials out there for that but let me know if you want anything specific. 3. Fire up your Jupyter Notebook environment and ingest your data source into a Pandas dataframe, ie: code:
code:
code:
code:
|
# ? Sep 3, 2020 15:27 |
|
Thanks for the suggestions, all! After reading, I've noticed a few things about this specific CSV file: it's always either 4 or 23 extra commas, and it seems as though this csv file is used to somehow create the (even more hosed up) Excel version of the quote. Our vendor must have some kind of proprietary software that reads the csv file and executes some kind of M Code in Excel to create the "pretty" (and I use that term EXTREMELY loosely) version of the quote that I guess would be customer-facing if we were reselling a direct contract? As per your suggestion, shoeberto, I've got a nice little miniconda setup with a local jupyter notebook so I think I'll start with pandas. Quite honestly, the only thing I really care about in the entirety of the CSV file is about halfway down into the file: code:
I sanitized the file for any information that can be used to identify me or the companies involved (I definitely don't work for or adjectent to CISCO, I can say that with 100% accuracy - and to be clear, any additional spaces or linebreaks out of place were in the original file, i've done nothing but change names and addresses to protect the innocent), honestly, it's quite a wonder to behold, and it almost makes sense if you look at it as they format it in a context in which it would be more easily human readible. Appreciate the input! code:
Lib and let die fucked around with this message at 16:03 on Sep 3, 2020 |
# ? Sep 3, 2020 16:00 |
|
Phone posting so can’t post sample code, but if the only part of the file you want is:code:
|
# ? Sep 3, 2020 18:18 |
|
I see another pattern too, a lot if it seems to be *key name*,,,,,*some kind of value here*. It would probably be awful but it looks like you could regex your way out of this.
|
# ? Sep 3, 2020 18:22 |
|
Ok bud, not gonna lie, I might have steered you wrong the first time around by suggesting Pandas. That's really for properly-structured tabular data. In this case, if you can get a hold of the Excel file, you *may* be better off using openpyxl: https://openpyxl.readthedocs.io/en/stable/ Assuming they are trying to use Excel to basically handle styling something that's not really tabular data, you are probably going to want to just access individual cells and extract the data into a normalized format. Like you can do what others are suggesting and regex the hell out of it, but that's probably going to be somewhat fragile. If for some reason the Excel file actually merges all of that crap into meaningful cells, it's going to be easier to do like: code:
If, from there, you want to transform it into a normalized format, it becomes relatively straightforward, and you can use whatever you want. For example, you could make it into an array of dicts: code:
|
# ? Sep 3, 2020 18:31 |
|
Yeah that's a ~25 col wide excel sheet in csv format. Messing with it in terms of excel is probably the way to go.
|
# ? Sep 3, 2020 18:42 |
|
Dr Subterfuge posted:Yeah that's a ~25 col wide excel sheet in csv format. Messing with it in terms of excel is probably the way to go. Funny story, I had actually considered using the Excel format, even considered maybe doing some fancy PowerQuery stuff, but ended up going with the CSV file thinking I could somehow manipulate it to be more normal. The thing with the spreadsheet (and why I quoted the good Dr specifically) is that it's also Extremely hosed Up and Stupid: But, with what you've all offered here, I think I may end up using the spreadsheet after all since it will be handy to be able to reference the cells specifically. There are, as you mentioned ~25 functional columns here, so I think I can work with this. Since the visual layout won't matter, I'll likely just un-merge all the columns and access their values directly. Thanks again! e: I've also got the ability to export as an XML file, but it's a SOAP XML file, which based on what I've looked into so far, requires some kind of additional URL provided that's some sort of decryption or authorization key? I'm not really sure how to handle that type of XML, and it doesn't seem to behave nicely if I'm trying to read it as a normal XML file. Lib and let die fucked around with this message at 19:08 on Sep 3, 2020 |
# ? Sep 3, 2020 19:02 |
|
Hahaha what an abomination of a sheet. At least the values are always stored in the first cell of a group of merged cells.
|
# ? Sep 3, 2020 19:19 |
|
|
# ? May 15, 2024 04:48 |
|
The fact that a spreadsheet like that is possible is both fascinating and frightening.
|
# ? Sep 3, 2020 19:28 |