|
This is a rookie DE/ETL question but I'm trying to wrap my head around the AWS tools I should be using... As an exercise, I want to first download twelve gzipped CSV files from this LEGO database. Then I want to move that data in to a new AWS RDS for MySQL using the given schema. The end result being I can write queries against that db. https://rebrickable.com/downloads/ What is a "modern AWS " way to do this? Azure Data Factory has "low code" pipelines that makes it relatively simple but I'm not sure how to go about it with AWS. *Here is the Azure Data Factory project that I'm trying to reproduce using AWS tools: https://www.cathrinewilhelmsen.net/series/beginners-guide-azure-data-factory/page/2/
|
# ? Feb 2, 2022 04:46 |
|
|
# ? Jun 5, 2024 04:44 |
|
Put the CSVs in S3, use a glue crawler to read the CSVs and output it to RDS, I think. Alternatively use Athena to query the tabular data in S3 directly.
|
# ? Feb 2, 2022 06:26 |
|
Happiness Commando posted:Put the CSVs in S3, use a glue crawler to read the CSVs and output it to RDS, I think. Thanks for the ideas. I'm guessing I can use AWS Lambda and write a python function that could get/copy the CSV files from the website and then place on to S3, then roll from there.
|
# ? Feb 2, 2022 16:25 |
|
Create a lambda function to pull the files into S3 Then either Point Athena at the bucket Or Data Pipeline (or your own ETL script on a t3.micro instance) to load the CSV from S3 to RDS And: A lambda function to turn on/off the EC2 instance when not processing the CSV
|
# ? Feb 2, 2022 17:08 |
|
Oh man speaking of Glue/Athena/etc. How cool is that new CloudTrail DataLake service! For my poo poo show of an org that will be a huge benefit. If I could only convince them to pay for it now.... EDIT: And while I'm at it. All the EKS alerts for GuardDuty are huge! Seriously nice work by that team and I hope more are in the pipeline.
|
# ? Feb 2, 2022 17:17 |
|
Agrikk posted:Create a lambda function to pull the files into S3 Exactly what I was looking for, thanks! BaseballPCHiker posted:Oh man speaking of Glue/Athena/etc. The amount of services/tools available is staggering to me. Just looking at Data Engineering, there is a shitload of services/tools to wrap your head around. I can imagine Security is even more so. We live in cool times.
|
# ? Feb 2, 2022 19:39 |
|
My new employer has pockets and is willing to pay for good training. Is there any well-respected AWS training courses/companies that I should check out? They'd need to be data focused courses. I'm thinking something like what SANS does for cyber training.
|
# ? Feb 3, 2022 02:42 |
|
Hughmoris posted:My new employer has pockets and is willing to pay for good training. Is there any well-respected AWS training courses/companies that I should check out? They'd need to be data focused courses. These are the AWS courses for data, just need to find a training provider, they are linked from the specific courses I think. All the providers are following the same course if you go through this. https://www.aws.training/LearningLibrary?query=&filters=Domain%3A107%20Language%3A1&from=0&size=15&sort=_score I've used Bespoke Training in Australia who were good.
|
# ? Feb 3, 2022 03:42 |
|
Agrikk posted:Create a lambda function to pull the files into S3 If you can run it from ECS they have tasks that just run and exit. I am using it like that to do the reverse, dump a database to an s3 location to create a file that is always the latest data so a 3rd party can analyze it. It just took creating a docker file and setting up the task.
|
# ? Feb 3, 2022 16:56 |
|
JHVH-1 posted:ECS I always forget about containers. It's like I grew up in a servers-as-pets world and then skipped straight to serverless-as-Lambda. For those looking for free and interesting datasets to build workloads from, here's a list of the data sources I use: United States Geological Survey earthquake catalog - get running list of all the reported and detected earthquakes wordwide https://earthquake.usgs.gov/fdsnws/event/1/ Visual Crossing Weather data - there's a free tier for pulling weather data from locations around the world https://www.visualcrossing.com/weather-api also WeatherUnderground http://api.wunderground.com/api Geyser eruption times for all the geysers in Yellowstone National Park. This one is exceptionally fun to use with Machine Learning: Try to create an AI/ML that will predect eruption times! https://www.geysertimes.org/api/v5/docs/index.php Lego database: https://www.kaggle.com/rtatman/lego-database Eve Online market database. Pull every market order in Eve online in near real time https://esi.evetech.net/ui/ couple that with the Eve Online static data export to build nifty web apps https://developers.eveonline.com/resource Folding@home - Forgot my favorite for Big Data stuff: https://apps.foldingathome.org/daily_user_summary.txt.bz2 https://apps.foldingathome.org/daily_team_summary.txt.bz2 The F@H user data is a fun one because there's something like two million rows in a single file. Pull it every time it refreshes (like every 90 minutes or so) and you can have a billion records in a table after about four months (My Users data table has 5.7 billion rows and is about 900gigs in size which is really helpful for managing and manipulating data at scale. Agrikk fucked around with this message at 02:57 on Feb 4, 2022 |
# ? Feb 4, 2022 01:36 |
|
Whoa that is good stuff, thanks!
|
# ? Feb 4, 2022 02:00 |
|
The TfL data sets can be interesting to work with as well: https://tfl.gov.uk/info-for/open-data-users/our-open-data?intcmp=3671
|
# ? Feb 4, 2022 12:27 |
|
Goal: Have an autoscaling group that launches 4 instances with ENIs as the primary and only network interface. Reason: We have purchased carrier IP space and do not want to use 2 IPs for every instance when we only need one. Having a hard time coming up with a way to do this. I can create an autoscaling group, launch instances with a dynamic IP on eth0 and then use user-data to attach an ENI as eth1. But that uses up two of our IPs per instance. I can create a launch template and define an ENI in it. I can then launch a single instance with the launch template and it comes up with the ENI as the only network interface. I don't see a way to do this with autoscaling, though. I'd be fine creating 4 autoscaling groups with a min/max of 1 instance but when I try to create an autoscaling group with a launch template that has an ENI definition in it, it fails with "Incompatible launch template: Network interface ID cannot be specified as console support to use an existing network interface with Auto Scaling is not available." Essentially, what I need is a way to have a pool of 4 ENIs and tell autoscaling to use that pool when launching an instance. Does such a thing exist in AWS currently? Scrapez fucked around with this message at 20:14 on Feb 7, 2022 |
# ? Feb 7, 2022 20:10 |
|
The autoscaling group resource has a field for specifying a launch template, the same kind that you would use for an EC2 instance. This is distinct from and mutually exclusive with the usual "ASG Launch Configuration" config item. I don't have great access to the ASG web interface at the moment but this setting should be hiding in there somewhere.
|
# ? Feb 7, 2022 20:38 |
I am completely new to this but currently studying for the SOA-C02 so forgive me if I'm far off base here but I wanna take a stab at it. In this case, would attaching the ASG to a VPC and a gateway work?
|
|
# ? Feb 7, 2022 20:58 |
|
I'm dumping files to S3 and on a schedule need to take all the new ones and convert them to custom avro and make avro files for every N files. I have a Python function do bottle up and convert, is there a more elegant way to do this than s3 sync to a computer with a real file system and push it back? I've used Kinesis firehouse on ingest but don't see anything that could accomplish what I want.
|
# ? Feb 18, 2022 04:14 |
|
Hed posted:I'm dumping files to S3 and on a schedule need to take all the new ones and convert them to custom avro and make avro files for every N files. Could schedule an AWS lambda function (e.g. cron or rate) to do it if the 15min timeout isn't an issue in your application. It sounds like you dont want to trigger on each new file in the S3 bucket but if you did, AWS lets you trigger a lambda by adding a file to S3 p easily.
|
# ? Feb 18, 2022 04:23 |
|
CarForumPoster posted:Could schedule an AWS lambda function (e.g. cron or rate) to do it if the 15min timeout isn't an issue in your application. Yeah, lambda on a schedule would be the way to go. Just setup a rule in EventBridge to call the lambda. Edit: It will be easier to do this file by file if you trigger the lambda for each created file in the s3 bucket, as that way you get passed the details of the item (bucket arn and object key) when the lambda is triggered. If you run it on a scheduled you'll have to call the s3 api to list the objects and iterate through them. But if the avro grouping is a requirement that isn't an option. ledge fucked around with this message at 05:08 on Feb 18, 2022 |
# ? Feb 18, 2022 05:00 |
|
Does anyone have any experience, or heard of experiences, for working at an AWS DoD gig? ClearanceJobs has a ton of openings for AWS gigs that look interesting.
|
# ? Feb 25, 2022 21:40 |
|
I have a React app I am looking to host on AWS. I have a few constraints: I can’t use S3 to host because the bucket must be completely private. Access to the app is only intranet or company VPN. Basically all public facing solutions are out. I was exploring cloud front serving the private S3 files and putting a WAF on top limiting IP ranges. Is there a better more sustainable solution? Ideally I’d like to template this with Terraform so I can spin up the same stack for the next series of web apps.
|
# ? Mar 8, 2022 04:18 |
|
You don't need a WAF to limit ingress IP - you can do that with regular security groups. It's a fine thing to add if you want, though
|
# ? Mar 8, 2022 05:05 |
|
Happiness Commando posted:You don't need a WAF to limit ingress IP - you can do that with regular security groups. It's a fine thing to add if you want, though If he's going via CloudFront he will because you can't attach security groups to a CloudFront distribution. lazerwolf posted:I was exploring cloud front serving the private S3 files and putting a WAF on top limiting IP ranges. We use basically this setup on our end, all in Terraform as well. But because we all work remote and none of us have static IPs it requires us to route requests for CloudFront's public IP addresses through our VPN. So, if that's something you have to worry about you might get a lot more traffic on your VPN than you bargained for.
|
# ? Mar 8, 2022 05:16 |
|
lazerwolf posted:I have a React app I am looking to host on AWS. I have a few constraints: S3 proxy integration with a private API gateway (requires VPC endpoint, and I presume you can already tunnel in to your VPC) could achieve what you want here if you want to be fully serverless. Otherwise you could just use nginx on an EC2 instance or if you need some redundancy you could do a service on Fargate.
|
# ? Mar 8, 2022 07:27 |
|
Maybe an S3 Access Point? I haven’t tried it so there could be quirks, but looks feasible. Edit: Maybe don’t even need the access point. I had an S3 bucket with just a resource policy allowing based on “aws:sourceVpc”. Might work if the vpcs have S3 gateway endpoints, and you just need to allow by vpc and not some general IP thing. crazypenguin fucked around with this message at 23:59 on Mar 8, 2022 |
# ? Mar 8, 2022 22:53 |
|
After a lot of struggling with permissions, I've gotten it to work, but I have a question regarding prereq #3, "An AWS account with permissions to create the necessary resources." I wanted to grant as little privilege as possible to the user I have associated with this workflow, but there doesn't seem to be any information on what should be granted. I'm sure some of this will depend on what exactly ends up in the CloudFormation template, but I was able to figure some of it out from the error messages along the way and have tightened down policies for: - Putting objects into just the S3 bucket that contains the template and deployment file - Granting GET/POST/PATCH on the API Gateway to point it to my Lambda The errors I was getting out of the SAM CLI weren't always specific, so the easiest way for me to make progress was to apply full access for CloudFormation, Lambda, and IAM on that user, which I know is the wrong thing to do. I'm not sure how to drill down to just the permissions needed to run the deployment and nothing else based on the error messages, so I thought to set up a CloudTrail event log and filter down to the deployment user once I got everything working, and then work backward from those logs to define a policy, which I could then apply to other users that correspond to different repos on the Github side. There's a better way to do this... right? Just found the thing where you can generate a policy from CloudTrail events. I knew it had to exist somewhere. nullfunction fucked around with this message at 01:09 on Mar 9, 2022 |
# ? Mar 9, 2022 00:52 |
|
I just got asked about Aurora multi-region multi-master, which doesn't exist. Now I get to have a whole bunch of meetings to determine business requirements and figure out what architecture will actually suffice
|
# ? Mar 12, 2022 01:25 |
|
Happiness Commando posted:I just got asked about Aurora multi-region multi-master, which doesn't exist. Now I get to have a whole bunch of meetings to determine business requirements and figure out what architecture will actually suffice You can do master-master replication across regions between Aurora MySQL if you're willing to have an EC2 instance in each region to fix egress and set auto_increment_increment and auto_increment_offset appropriately
|
# ? Mar 12, 2022 21:43 |
|
I'm working on migrating some services from being manually provisioned via the AWS console to using CDK instead. The application architecture is a web-facing service running on ECS to put jobs into an SQS queue and a backend service running on ECS to retrieve jobs from the queue and process them. So far, I'm implementing this in 3 tiers of stacks, 1 top level stack for resources shared company-wide across multiple applications (VPC, an S3 scratch bucket, etc), 1 application level "shared" stack that sets up the SQS queues, permissions, and ECR repositories for both aspects of the application code, and finally a stack each for the web API and backend processing ECS deployments. The stack to deploy ECS requires a task definition that points to the image in ECR, so when the application code changes, we create and tag a new docker image and push to ECR. But afterwards, what is the "correct" way to update the running tasks? Should the ECS task definition be updated via running cdk deploy or running the aws update-service CLI command? We had a consultant help set this up initially, but they left us with deployment stage using both methods, which seems like overkill, plus deploying via the ECS stack resets the number of desired instances so I feel like going CLI only for application version updates is the correct way. Regardless of which deployment method, I've found that I also need to store the latest version tag in ssm so that if we do update anything in the CDK stack (things like instance type, scaling parameters, etc), the task definition can find the correct latest version, but I guess my main question is how close is this setup to "standard" and is it supposed to feel this convoluted.
|
# ? Mar 15, 2022 13:45 |
|
Plank Walker posted:Should the ECS task definition be updated via running cdk deploy Yes you can do it this way. Cdk will build a new image and push the image to the bootstrap ECR repository or your specified repository, then publish a new task definition with the new image, and start an ECS deployment of your service. Are you using any of the aws-cdk/ecs-patterns constructs?
|
# ? Mar 15, 2022 22:51 |
|
Edit: I found the answer on AWS mappings site: "You can't include parameters, pseudo parameters, or intrinsic functions in the Mappings section." So, does someone have a suggestion on how I would import the VPCID for the region I'm executing the security group cloudformation template in? I don't want to hardcode the value of the VPC into the template because then I'll have to have a different template for each region. I'd like to only have a single network template and security group template that I can execute in multiple regions. Cloudformation Outputs, Imports, Mappings question: I have a cloudformation that builds the network pieces VPC, Subnets, etc etc. I've executed this in two different regions and it has an output section that outputs various resource values. For instance: it outputs the VPC ID: code:
I've tried to define a mapping that pulls in the appropriate value of the VPC ID like so: code:
Once it does the sub and imports the value, it should be a single string so I'm a bit stumped as to why it doesn't like that. Anyone have an idea? Scrapez fucked around with this message at 21:19 on Mar 28, 2022 |
# ? Mar 28, 2022 20:41 |
|
I don't think that you can use any of the cfn intrinsic functions in Mappings, but it's a little hard to say if that is the exact issue here because I'm not super clear on where the Mappings key starts in your second example. Without any other context, in this scenario, I would recommend two things: 1- If you can avoid prefixing your "production vpc stack" with a per-region name, you can just import it directly. You can just call it production, export it as production-VpcId and then instead of using Mappings, just Fn::ImportValue production-VpcId. Since the stack must exist in only a single region, the region is implied (and available elsewhere in the API), and you don't need the mapping. 2- Since you can't change the VpcId of a security group without deleting it, I would embed the security groups in the VPC stack and just use !Ref. In my experience it's a good idea to avoid introducing cross-stack references alongside a "replacement" update behavior, if you can. Comedy option 3: If you use ansible for this, the "template_parameters" field is recursively parsed, so you can pass arbitrarily complex maps to cloudformation with it. Comedy option 4: If I'm misunderstanding what "Production-OH-Network" means, and you do have this kind of double-dynamic relationship where any given VPC consumer stack needs to consume an output from a stack that you, for some reason, can't know the name of, I would probably use a nested stack instead and then pass the input params through AWS::CloudFormation::Stack.
|
# ? Mar 28, 2022 21:24 |
|
12 rats tied together posted:I don't think that you can use any of the cfn intrinsic functions in Mappings, but it's a little hard to say if that is the exact issue here because I'm not super clear on where the Mappings key starts in your second example. I appreciate the response greatly. Option 1 is clearly what I need to do. I was severely overthinking things. If I just export the value with a non-regional specific name then I can import it with the security group template. As you said, I'll only be executing a particular stack in a single region so that will work fine. To your point in option 2, I'd wanted to make the security group template separate simply due to the number of security groups, ingress, and egress rules contained within. It's quite bulky and then ends up making the Networking template quite large as a result. Just thought maintenance might be easier having smaller separate templates. I'm pretty new to cloudformation, though. Is it more common to have a larger template file than breaking pieces out into their own? Thanks again for the help.
|
# ? Mar 28, 2022 21:50 |
|
If it makes the template cumbersome to read and understand, you're absolutely right to split it up like this. Security Groups are a super overloaded concept in AWS so what I generally prefer to see is that you make a distinction between "network" SGs and "membership" SGs. Membership SGs are for when you have something like "the chat service" which is comprised of a bunch of other AWS crap. The chat service SG, which contains every applicable member of the chat service, lives in the chat service template just for convenience. You mostly use this SG for its members, for example, a load balancer config where you need to allow traffic to every member of the chat service. Network SGs are for when you have something like "allow inbound traffic from the office". It's not tied to a particular service, so it doesn't have a service stack to live in, your options are basically to have a Network SG stack or to embed it somewhere that logically relates to things in AWS that have network connectivity to things not in AWS. I usually end up deciding that the vpc stack is the best place and I throw them all in there, but I rarely have more than like 5 of these "Network SGs" so it is not especially cumbersome. If I had 50, I would absolutely put them in their own stack, and that stack would probably also be a good place for network ACLs to live if I had any.
|
# ? Mar 28, 2022 22:00 |
|
12 rats tied together posted:If it makes the template cumbersome to read and understand, you're absolutely right to split it up like this. Security Groups are a super overloaded concept in AWS so what I generally prefer to see is that you make a distinction between "network" SGs and "membership" SGs. Appreciate that suggestion. We do have quite a lot of different SGs and rules within them. Redistributing them into separate templates based on the service they apply to would definitely make more sense and help manage them.
|
# ? Mar 29, 2022 20:10 |
|
Is there any way to do multi-region with Aurora Serverless? We have a database that has very low utilization with occasional small spikes so it's perfect for serverless but I need to have it be multi-regional.
|
# ? Apr 1, 2022 15:56 |
|
Scrapez posted:Is there any way to do multi-region with Aurora Serverless? We have a database that has very low utilization with occasional small spikes so it's perfect for serverless but I need to have it be multi-regional. They claim serverless v2 is compatible with Aurora global, which is multi-region. Mysql only, and it's in preview. https://aws.amazon.com/rds/aurora/serverless/ posted:
|
# ? Apr 1, 2022 22:45 |
|
I'm trying to understand cloud pricing so I'm not such a mook. Data transfer out of us-east-1 costs $0.09 per GB. Cheaper regions cost $0.05 per GB. Backblaze charges $0.01 per GB. Both services claim eleven nines of durability. Why such a big price difference?
|
# ? Apr 3, 2022 20:11 |
|
Cheston posted:I'm trying to understand cloud pricing so I'm not such a mook. Data transfer out of us-east-1 costs $0.09 per GB. Cheaper regions cost $0.05 per GB. Backblaze charges $0.01 per GB. Both services claim eleven nines of durability. Why such a big price difference? Because they can?
|
# ? Apr 3, 2022 21:45 |
Cheston posted:I'm trying to understand cloud pricing so I'm not such a mook. Data transfer out of us-east-1 costs $0.09 per GB. Cheaper regions cost $0.05 per GB. Backblaze charges $0.01 per GB. Both services claim eleven nines of durability. Why such a big price difference? Yup what the other guy says. AWS charges a premium because they are the market leader and can. It's way more expensive than the competition, but also way more complete in terms of all the services available in AWS.
|
|
# ? Apr 3, 2022 22:15 |
|
|
# ? Jun 5, 2024 04:44 |
|
If all you need is cheap storage, use Backblaze. Amazon charges the premium because of all the poo poo that works together with that storage.
|
# ? Apr 3, 2022 22:29 |