Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
luminalflux
May 27, 2005



Corvettefisher posted:

Just throwing this out there but,

So what storage vendors do you all use and why?

HP because everything else is HP and that's what our VAR recommended we get.

Adbot
ADBOT LOVES YOU

adorai
Nov 2, 2002

10/27/04 Never forget
Grimey Drawer
NetApp for our applications + most of our VMware. We love all the snap management products and the performance is great. It is however very expensive compared to...

Oracle/Sun for our VDI and dead data that we have to retain for some period. We love the price and the performance. Management is pretty simple, though it does not have the cool software that NetApp does.

Bitch Stewie
Dec 17, 2011

Corvettefisher posted:

Just throwing this out there but,

So what storage vendors do you all use and why?

HP Storevirtual. Mainly because you can do VMware certified metro clustering with it insanely cheaply.

Aradekasta
May 20, 2007
My research group wants a NAS or file server for about 12T of existing data, ideally extensible up to ~20T, on the cheap. We already have a Synology Diskstation with some 'prosumer' level drives for local storage (4x2T drives in RAID 5), but we also have been using a university hosted storage/backup solution and we're finding it too expensive, so we want to take the 12T we have there and move it to our own file server. What can we do that's a little more robust than just buying a bigger NAS, but won't drain our grant budget forever?

Keep in mind this group used to store its old data in a literal cardboard box of loose hard drives, so high storage costs make the boss think, "what the hell, I could just get a bunch of drives from Best Buy!" On the other hand, we have an opportunity to get some funding for lab equipment in the near future, so we'd probably consider options up to maybe $15k, though the pressure will be on to keep it under half that (which I think is the level at which you'd be worried about trusting your porn collection to that hardware, never mind years of critical research data you are in theory obligated to provide upon request).

I believe the funding is for capital equipment only, so Amazon S3 et al. are out. Ideally whatever we do can also more or less run itself, since we have no dedicated IT staff and the local 'computer people' are both leaving within 6 months.

Thanks Ants
May 21, 2004

#essereFerrari


Pretty sure the quotes I had for lower end NetApp were around that figure about 8 months ago, before any haggling on price.

Edit: The 15k figure, not the half that option. As mentioned above, there's nothing technically wrong with Synology units, but they don't offer any sort of SLA and I'm not sure I'd want critical data on a box that can also be an iTunes server.

Thanks Ants fucked around with this message at 04:25 on Jun 9, 2013

adorai
Nov 2, 2002

10/27/04 Never forget
Grimey Drawer

Aradekasta posted:

we'd probably consider options up to maybe $15k, though the pressure will be on to keep it under half that (which I think is the level at which you'd be worried about trusting your porn collection to that hardware, never mind years of critical research data you are in theory obligated to provide upon request).
if you are replacing a prosumer unit, I would probably just roll my own, with a 6+2 of 3tb drives raidz2 (plus some SSD for cache) on something that runs ZFS. You can double it with a second 6+2 at some point in the future. At the prices we would be talking about, you can afford to buy a second identical unit to replicate to. Since it's a university, I am guessing you could do high frequency asynchronous replication at gigabit or higher speeds to somewhere else on campus. Performance should beat your current solution, and with some NAS ZFS implementations can be extremely easy to manage.

parid
Mar 18, 2004
Dell is pretty much giving away MD's now, might be worth checking into.

FISHMANPET
Mar 3, 2007

Sweet 'N Sour
Can't
Melt
Steel Beams
What is it with faculty and storage, it's like they're all morons about it.

parid
Mar 18, 2004
I think they believe they are able to make the best choice for themselves for anything. The information they have is how much it costs them to go to Costco and but an external 2tb drive. anything bigger should be perportionally more expensive right? Since they have a superior ability to critically think, why listen to anyone else? Don't understand the decision? Who cares, take your ball and go home.

Its not like working together for a common good has ever helped anyone.

Aradekasta
May 20, 2007
Thanks for the ideas, guys. In this case the cost of the storage we're using through the school really is excessive as a proportion of our lab's operating costs, which is a bad thing given how squeezed science budgets are right now. It's just that the prof has no sense of what's available between the drive you get from Best Buy and the $3k/T/year managed solution.

There are also cash flow issues due to the grant system. With the sequester, some grants have been funded but not disbursed for months after the original target date. Purchases of large shared equipment usually go through a different funding mechanism than normal lab expenses. The combined incentive is for everyone to buy their own cheap-rear end hardware rather than sharing or using a third-party service.

adorai posted:

if you are replacing a prosumer unit, I would probably just roll my own

When we bought the diskstation I thought 'next time I'm just going to build one myself', but now that it's next time, I'm reconsidering. I'm leaving in the next 3-6 months and my 'replacement' on the research side just met Linux a month ago. It's possible the funding won't even come through till after I'm gone, and he's definitely not ready to get a pile of parts on his desk and turn them into a functioning file server. Even if it comes earlier, I don't want to spend valuable thesis-writing cat-video-watching time on setting up something that nobody has the ability to maintain, because they'll just end up with a bunch of external drives anyway.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Aradekasta posted:

$3k/T/year managed solution
Holy loving poo poo, we charge less than 1/3 of this, don't bill it yearly, and that includes replication to our DR site. Is this all 15K tier-1 storage?

Aradekasta
May 20, 2007
:sigh: I know. It's their top-tier storage - snapshots, offsite tape backups, etc. That's the price for our allocation on an Isilon system. Their lower-tier storage is much cheaper - $700/T/year, IIRC - but that's not backed up, and also not accessible from the compute nodes of the cluster, which is annoying. Currently we're split between these two, in addition to our local NAS.

The pricing is a little bit silly, because it's all ultimately just the university billing itself, so above-market rates for things like storage derive from the equally fictitious university-defined price for the physical data center space. This place is a total bureaucratic black hole and everything might as well be Monopoly money until, with luck and possibly a ritual sacrifice, it emerges from the death grip of some cranky accountant to actually pay a vendor.

In other words, if faculty are morons about storage, it's because everything in their environment has trained them to be.

FISHMANPET
Mar 3, 2007

Sweet 'N Sour
Can't
Melt
Steel Beams
God, at my University just gives space away on its Isilon. My department also provides it's own storage, but because of grant fuckery the only way we can sell it to faculty is to have them buy individual drives for our Compellent boxes, and then the department as a whole just eats the cost of enclosures and parity disks.

YOLOsubmarine
Oct 19, 2004

When asked which Pokemon he evolved into, Kamara pauses.

"Motherfucking, what's that big dragon shit? That orange motherfucker. Charizard."

FISHMANPET posted:

but because of grant fuckery the only way we can sell it to faculty is to have them buy individual drives for our Compellent boxes

This is so dumb I can't even form the words to describe it. Only some sort of hyper-dimensional chart could do it.

FISHMANPET
Mar 3, 2007

Sweet 'N Sour
Can't
Melt
Steel Beams
Welcome to higher ed, where everything's made up and the points don't matter.

evil_bunnY
Apr 2, 2003

NippleFloss posted:

This is so dumb I can't even form the words to describe it. Only some sort of hyper-dimensional chart could do it.
It's also not true. If your scientists write their grant apps the way they should and your staff isn't straight retarded this isn't a problem.

FISHMANPET
Mar 3, 2007

Sweet 'N Sour
Can't
Melt
Steel Beams
I think it might actually have to do with how the university itself is setup, it's really hard for one account to bill another account for something, we have to setup special accounts to do it and they require special justification from the CFO of the university or something and also we're all pants on head retarded.

Thankfully that's levels above me so I just work within the constraints I'm given and call it a day.

Aradekasta
May 20, 2007

FISHMANPET posted:

grant fuckery

FISHMANPET posted:

we're all pants on head retarded.

Sounds about right.

parid
Mar 18, 2004
Yup, that's higher ed. They are specifically structured do that each school/department has total freedom. The system is set up to punish people for doing the right thing. Most of the it money is split up and divided out to each school who gets to make their own decisions with it. Without charging its self, there would be no way to fund large central projects.

Your central rates are very reasonable. Have you sat down with your contact there and explained your budget issue? They might be able to cut you a break to "help you do the right thing"?

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

parid posted:

Your central rates are very reasonable.
Try explaining this to the 25 different research groups that are already using Dropbox for Business while IT has absolutely no clue.

evil_bunnY
Apr 2, 2003

Misogynist posted:

It gives you some redundancy from all of the most common physical damages to a datacenter -- flood, fire, electrical --
This is the part I'm uneasy with BTW. In my experience all of those things tend to come in neighborhood-wide instances, and when your main site is burning, your DR might be getting flooded with FD water. Putting stuff a little ways away (not even across town, just far enough that you're elevated and on a different subgrid) buys you a lot of safety while allowing you to keep the advantages of closely located sites.

parid
Mar 18, 2004

Misogynist posted:

Try explaining this to the 25 different research groups that are already using Dropbox for Business while IT has absolutely no clue.

That is my job 8-5. Its not easy. There are still people willing to look through that and try to work together. That's where I put most of my time.

Demonachizer
Aug 7, 2004
Just as an addendum to my previous post about synchronous replication. I setup a test environment on the same switch in the same room with two Equallogic PS4000s and ended up with the following results from Iometer:

code:
non-sync

READ
110MB/S
Average Response Time ms 18
Max Response Time ms 52

WRITE SEQ
105MB/S
Average Response Time ms 19
Max Response Time ms 45

WRITE RAND
95MB/S
Average Response Time ms 20
Max Response Time ms 55

Sync
READ
110MB/S
Average Response Time ms 18
Max Response Time ms 52

WRITE SEQ
65MB/S
Average Response Time ms 29
Max Response Time ms 78

WRITE RAND
9MB/S
Average Response Time ms 300
Max Response Time ms 1053
I am going to check if there is something glaringly wrong with my configuration but the drop in random writes is pretty terrible. I will move to asynchronous probably tomorrow if I can't figure out any remedy.

evil_bunnY
Apr 2, 2003

You have to wait for both devices to acknowledge so of course it'll kill your writes. Increasing queue depth might help, but it's all synthetic anyway.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.
Does your remote target perform better when you random-write to it directly?

evil_bunnY
Apr 2, 2003

Misogynist posted:

Does your remote target perform better when you random-write to it directly?
That's a good question to ask. Also why the max latency was a full second.

Demonachizer
Aug 7, 2004

evil_bunnY posted:

That's a good question to ask. Also why the max latency was a full second.

Boss man saw the results over my shoulder so I will never know :( I am moving to rep sets as of 20 minutes ago. I might just delay a bit just so I can test some things out since it is not quite in production yet.

YOLOsubmarine
Oct 19, 2004

When asked which Pokemon he evolved into, Kamara pauses.

"Motherfucking, what's that big dragon shit? That orange motherfucker. Charizard."

demonachizer posted:

Boss man saw the results over my shoulder so I will never know :( I am moving to rep sets as of 20 minutes ago. I might just delay a bit just so I can test some things out since it is not quite in production yet.

What size blocks are you using for testing?

Demonachizer
Aug 7, 2004

NippleFloss posted:

What size blocks are you using for testing?

32KB

Just checked on the second SAN and I am getting 0 throughput and it just sits at 100% CPU utilization and only errors. I am working on figuring out why since there is no difference in the setup or the volume config as far as I can tell.

Got it all figured out. Both SANs are behaving the same separately but getting the same horrible results when doing the sync rep.


EDIT2: Just got word that it is possibly due to the fact that we are using a Cisco 2248TP which is a fabric extender and so all traffic goes to the nexus and back to the FEX module instead of staying in place.

It sounds like the recommendations we got for network devices may have been a little iffy.

Demonachizer fucked around with this message at 20:36 on Jun 12, 2013

evil_bunnY
Apr 2, 2003

demonachizer posted:

EDIT2: Just got word that it is possibly due to the fact that we are using a Cisco 2248TP which is a fabric extender and so all traffic goes to the nexus and back to the FEX module instead of staying in place.
This is true, but unless your nexus pegs its cpu during the random test it's not the root cause. And it doing fine with seq r/w points that way too.

Demonachizer
Aug 7, 2004

evil_bunnY posted:

This is true, but unless your nexus pegs its cpu during the random test it's not the root cause. And it doing fine with seq r/w points that way too.

Yeah I found a white paper on that specific FEX module and best practices for integration with the PS6000 (ours is a PS4000 but I think a lot of the considerations should be the same) that I am going to have the network guy look at. He is pretty drat knowledgeable especially with Cisco technology so I think that we might come up with something. We are comfortable switching away from SyncRep (the equallogic nomenclature for their synchronous replication) as it wasn't part of our initial design spec anyway. It just seemed like we might have everything in place for it so why the heck not.

YOLOsubmarine
Oct 19, 2004

When asked which Pokemon he evolved into, Kamara pauses.

"Motherfucking, what's that big dragon shit? That orange motherfucker. Charizard."

LIke evil_bunnY said, I'm not really sure how it could end up being a switching issue given that it is only effecting random workloads which are going to be lower bandwidth and stress the switches much less. If your switching infrastructure was introducing serious delays I'd expect to see that effect higher bandwidth sequential write loads just as much as the random ones. I'd expect to see some serious packet loss or congestion pushing TCP windows way way down if network was the cause, and that should be immediately apparent looking at the port stats on the controllers and switches.

Demonachizer
Aug 7, 2004

NippleFloss posted:

LIke evil_bunnY said, I'm not really sure how it could end up being a switching issue given that it is only effecting random workloads which are going to be lower bandwidth and stress the switches much less. If your switching infrastructure was introducing serious delays I'd expect to see that effect higher bandwidth sequential write loads just as much as the random ones. I'd expect to see some serious packet loss or congestion pushing TCP windows way way down if network was the cause, and that should be immediately apparent looking at the port stats on the controllers and switches.

Any idea what we should be looking at then? When I do random writes to each SAN individually everything seems fine.

YOLOsubmarine
Oct 19, 2004

When asked which Pokemon he evolved into, Kamara pauses.

"Motherfucking, what's that big dragon shit? That orange motherfucker. Charizard."

demonachizer posted:

Any idea what we should be looking at then? When I do random writes to each SAN individually everything seems fine.

Without knowing how SyncRep works under the covers it's hard to say what could be causing it, but it sounds like a protocol problem rather than a configuration problem. Have you looked at the port stats on the controllers (not sure what is available on EQL, but stuff like errors, re-transmits, send-q, window sizes and pause frames would be what I would look) and switches to see if things look good on the network side? Your network admin could do a packet trace from the switch and help you trace how long it is taking for your SyncAlternate controller to ack an IO.

I know we have some EQL users in this thread, so maybe they can shed some more light on how SyncRep actually handles writing to the secondary.

YOLOsubmarine fucked around with this message at 02:07 on Jun 13, 2013

Demonachizer
Aug 7, 2004

NippleFloss posted:

Without knowing how SyncRep works under the covers it's hard to say what could be causing it, but it sounds like a protocol problem rather than a configuration problem. Have you looked at the port stats on the controllers (not sure what is available on EQL, but stuff like errors, re-transmits, send-q, window sizes and pause frames would be what I would look) and switches to see if things look good on the network side? Your network admin could do a packet trace from the switch and help you trace how long it is taking for your SyncAlternate controller to ack an IO.

I know we have some EQL users in this thread, so maybe they can shed some more light on how SyncRep actually handles writing to the secondary.

Talked to a dude on Dell forums and he suggested investigating QoS on the switch side to make sure jumbo frames are lossless since that could cause the behavior as well. I will meet with the network dude in a couple days and work him hard. I guess poo poo like this is why my job shines in some ways. It is very low pressure. In another environment I would be drinking myself to sleep by now but here I can kind of just figure the poo poo out then make it live when I am ready.

I am also looking at logging methods etc. for the SANs in parallel to make sure all bases are covered and whether I can glean anything from it.

Aradekasta
May 20, 2007
The plan's changed a bit, since the new funding won't come through till next year. We need the cheapest NAS possible that will hold onto ~15T of data for six months without crapping out. I'm not in love with the Synology software, but the diskstation we have works well enough. Any obvious reason not to just pick up a bigger one?

parid posted:

Your central rates are very reasonable. Have you sat down with your contact there and explained your budget issue? They might be able to cut you a break to "help you do the right thing"?
I don't know about reasonable, but given all the other poo poo we need to buy/replace soon, we just can't afford those rates anymore. We've already gotten a lot of breaks - their prices for compute time are actually very good.

YOLOsubmarine
Oct 19, 2004

When asked which Pokemon he evolved into, Kamara pauses.

"Motherfucking, what's that big dragon shit? That orange motherfucker. Charizard."

demonachizer posted:

Talked to a dude on Dell forums and he suggested investigating QoS on the switch side to make sure jumbo frames are lossless since that could cause the behavior as well. I will meet with the network dude in a couple days and work him hard. I guess poo poo like this is why my job shines in some ways. It is very low pressure. In another environment I would be drinking myself to sleep by now but here I can kind of just figure the poo poo out then make it live when I am ready.

I am also looking at logging methods etc. for the SANs in parallel to make sure all bases are covered and whether I can glean anything from it.

Have you attempted to run a separate benchmark against another LUN on the primary or secondary while also running it against a SyncRep LUN? If you get controller wide performance problems whenever SyncRep is turned on that should pretty much rule out network as an issue (at least assuming you have dedicated links for replication traffic and client traffic).

Demonachizer
Aug 7, 2004

NippleFloss posted:

Have you attempted to run a separate benchmark against another LUN on the primary or secondary while also running it against a SyncRep LUN? If you get controller wide performance problems whenever SyncRep is turned on that should pretty much rule out network as an issue (at least assuming you have dedicated links for replication traffic and client traffic).

I just ran a test that was at the same time:

Worker 1 -> 100% random to SAN A SyncRep Enabled Volume 5-10MB/S
Worker 2 -> 100% random to SAN A Non SyncRep Volume 50-60MB/S
Worker 3 -> 100% random to SAN B Non SyncRep Volume 50-60MB/S

I think that a decrease in performance is probably expected when writing to multiple volumes at the same time so I am not sure that the 50% decrease on Worker 2 and 3 is a problem.

I am not sure that I am able to isolate replication traffic from other SAN traffic on the Equallogic PS4000. I can't think of a way to tag the traffic on the switch level but I am not a network guy. I will ask him if we can define routes based on source and target IP addresses for testing.

evil_bunnY
Apr 2, 2003

You need to test on different disk groups.

Adbot
ADBOT LOVES YOU

Demonachizer
Aug 7, 2004

evil_bunnY posted:

You need to test on different disk groups.

Not sure I understand what that means in terms of the PS4000. I can set up volumes but I am not sure I can set specific disks to service the volume.

The organization is by group which is a collection of SANs and then you define storage pools within the group and then you define volumes in the storage pools. Do you mean that I need a new group? I am only able to do Synchronous Replication within a group. I have the pools defined as the entirety of each SAN one is Primary one is Rep. Then within those pools I have two test volumes in the primary pool. One is set to replicate one is not. On the Rep pool I have two volumes. One is a standalone non replicated volume and the other is the replication target volume which is created automatically.

Demonachizer fucked around with this message at 17:57 on Jun 13, 2013

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply