NoSQL : This Thread is Web Scale

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > NoSQL : This Thread is Web Scale

«‹›4 »

Mr Shiny Pants: Nov 12, 2012

BabyFur Denny posted:

Learn MapReduce since the concepts are still everywhere in all modern frameworks even if the mapreduce framework itself is outdated. learn when you need to redistribute your data by key (i.e. data is shuffled around) and what can be done locally vs. distributed. Read Martin Kleppman's Designing Data Intensive Applications. Figure out what happens if any individual component in your architecture fails. Avoid Spark.

Also feel free to ask here. I have plenty of experience with running those three frameworks.

Thanks man, what's wrong with Spark? I read that Flink is supposed to be its replacement but there are some nice .Net libraries for Spark it seems.

# ? Mar 8, 2019 21:18

Adbot: ADBOT LOVES YOU

# ? May 3, 2024 17:04

Fluue: Jan 2, 2008

Got a question about DynamoDB / key-value denormalization and managing deletes/updates.

I'm trying to implement an RBAC-esque system using AWS Lambda and DynamoDB as the storage backend. There are a few intermediate services that also store data, but are unimportant here.

I've denormalized the User -> has many Roles -> (Roles) have many Permissions relationship into this:

Where the via_role is an array of roles that the user received a specific permission from.

The problem lies with managing role removals. If a role is revoked from a role I need to either delete the permission record if the user only has that permission from the role being revoked, or just remove the role from the permission record is that permission was also granted by another role. I need to do this for each permission the revoked role has. This also applies if the role is deleted from the system; I would need to go through every user that has that role and perform the same operation.

Am I on the right track here with the key-value denormalization of this hierarchical, almost-graph-database design?

# ? Mar 9, 2019 22:26

abelwingnut: Dec 23, 2002

got another project to ask about, and i have more to ask about the previous one with kafka and scylla--thank you for those responses. just haven't had time, and that one's more of a side thing than anything. but this issue's a bit more pressing.

in any case, i'm still new to nosql, so again, forgive me if this is basic.

we took over this project, and i need to choose between couchbase and redis for, essentially, a cached 'lookup' table. i say 'lookup' table because what we want is a bit more complex than key-value.

right now, we have three tables in a sql database in snowflake. an api calls to them when passed certain information. each table contains hundreds of millions of rows. this lookup is quite slow, and we need to make it fast.

each table has the same columns: hash_type, hash, device, type, timestamp

hash_type varchar(16777216) : takes one of three different options--a, b, or c. as they originally structured it, each table represents a type. so table 1 is all of hash_type 'a', table 2 is all of hash_type 'b', and table 3 is all of hash_type 'c'. and yes, table 1 has the column hash_type, and every field is 'a'. same for all other tables. yes, it is a waste and poorly designed.

hash varchar(16777216): a long string of characters that represents a particular user. this is not unique.

device varchar(16777216): another long string of characters that represents a device used for that particular user. there can be many devices related to one hash. i need to query to find out if a device can be linked to multiple users.

type varchar(16777216): takes one of three options (x, y, or ''). this is the operating system of the device.

timestamp number(38,0): a linux timestamp

alright, so as a sql guy, i see problems galore. but...we don't want sql, and i don't know nosql well enough. so i'm unsure how to tackle this.

my first thought was to look at the api code. it seems there are two forms of calls:

--

form 1:

const router = require('express-promise-router')();

// GET /get_maids_by_md5?md5=hash
const { statement, rows } = await req.app.get('snowflake').query({
sqlText:
"SELECT device AS maid, type, timestamp" +
" FROM table1" +
" WHERE hash = ? AND hash_type = a"
binds: [ req.query.md5 ]
});

res.json({
error: 0,
response: { maids: rows }
});

});

--

alter the table name and hash_type for the other two tables.

and form 2:

--

// GET /get_hems_by_x?sha1=hash
const { statement, rows } = await req.app.get('snowflake').query({
sqlText:
"SELECT hash, hash_type, timestamp"+
" FROM a" +
" WHERE device = ? and TYPE = x +
" UNION SELECT hash, hash_type, timestamp"+
" FROM b" +
" WHERE device = ? and TYPE = x +
" UNION SELECT hash, hash_type, timestamp"+
" FROM c" +
" WHERE device = ? and TYPE = x,
binds: [ req.query.x, req.query.x, req.query.x ]
});

res.json({
error: 0,
response: {
hems: rows
}
});
});

--

alter the type for the other query of this form.

so 5 queries of two different forms. again, these aren't ideal whatsoever from a sql standpoint. i'm also confused by the api code itself. i'm not familiar with it, so i can't tell if that code is actually passing in anything to the parameterized WHERE predicates, or if those are just needlessly there. like is the query in form 1 basically creating the lookup table of all hashes, or is it passing in a hash and returning just the devices associated with the passed in hash? i'm not so good outside of sql! and similarly, is the second form creating a lookup table of hashes, or is it passing in a particular device and getting back the hashes associated?

in any case, this what i'm working with.

so back to the question: do i want to translate this to redis or couchbase?

based on what research i've done, i'm not even sure i could put this in redis given those api calls. i could make the hash the key, and then use the redis hash feature for the devices and use that cool 'indexing' trick i found. but i'm not sure how that would incorporate the date, type, or hash_type as well. and given we want the client to be able to query off of those fields, those are necessary.

with couchbase, i could see making each hash a document, thereby making hash the key. then from there, i could create the necessary sub-objects (not really sure on the correct term--i need to research this more. arrays maybe?). in any case, then i could go wild with indices to satisfy the queries. indices in couchbase, from what i understand, are stored in memory, and are superfast. but this doesn't strike me as the best strategy, as i'll just be doubling the data in memory and being wasteful.

in any case, those are my thoughts as of now. not sure if there's a better option out there that might help with this, but just a) writing it out to think it through myself and b) looking for a little guidance one way or the other. this doesn't seem like the most outlandish request, so i'm sure there's a model or example or feature that can help me approach this. i think i mostly posted this as a cry for help in trying to break my sql-centric way of thinking. i'm sure there's a simple, fast solution to this.

sorry for the wall of words! and thank you for your time to the brave souls that made it through this.

abelwingnut fucked around with this message at 12:22 on Mar 19, 2019

# ? Mar 19, 2019 12:09

skimothy milkerson: Nov 19, 2006

source 👏🏼 your 👏🏼 are 👏🏼 quotes 👏🏼

# ? Mar 19, 2019 14:45

abelwingnut: Dec 23, 2002

Skim Milk posted:

source 👏🏼 your 👏🏼 are 👏🏼 quotes 👏🏼

huh? seriously not sure what you mean. or maybe this is in reference to another post?

# ? Mar 19, 2019 14:47

PIZZA.BAT: Nov 12, 2016

Abel Wingnut posted:

with couchbase, i could see making each hash a document, thereby making hash the key. then from there, i could create the necessary sub-objects (not really sure on the correct term--i need to research this more. arrays maybe?). in any case, then i could go wild with indices to satisfy the queries. indices in couchbase, from what i understand, are stored in memory, and are superfast. but this doesn't strike me as the best strategy, as i'll just be doubling the data in memory and being wasteful.

I don't have experience with couchbase but I DO have experience with document-based DBs so I can maybe help with your data modeling a bit. Generally if you're using a document-store you want your use case to drive your model. What are the business requirements driving this? If the queries going into the other database are producing a single 'entity' to be consumed then that's generally what you'll want your document to look like.

Secondly on assigning keys. I'm assuming this is similar to a range index in MarkLogic where the keys are stored in memory for fast retrieval. My general rule of thumb is to store primary keys in memory if you can afford it due to the dramatic performance gain you get from it. Typically your keys are going to be a few dozen bytes tops and when multiplied across a few millions documents gives you what... a few hundred megs? This is assuming a naive storage method as well and not something like a spanning tree. Very small price to pay considering the increased performance you get.

I'd recommend benchmarking the database and enabling the index just to see how much memory it takes. It can't hurt to try. Resources are meant to be spent!

# ? Mar 19, 2019 16:12

abelwingnut: Dec 23, 2002

thanks for the response.

i found out that the one type of api call passes in a hash, and the other passes in a device_id. therefore those queries aren't calling all rows, thank god. so i've basically decided to go with couchbase, with the hash most likely the primary key, and then creating a secondary index for the device_id as well. i think that'll give the speed we want.

in other news, and relating back to my first post with kafka and scylla.

as i mentioned before, i come from the land of sql, so normalizing is what i know and trust. but things are obviously different in nosql land.

so for this project i'm working on, i have approximately 12 dimension tables. they store information for users who log on to the web site powered by this database, campaigns that they're running, some lookup tables, agencies these users work for, and projects under these campaigns

naturally, this is all very structured, and lends itself to the world of sql. but i'll hopefully be logging billions of impressions connected to multiple campaigns and projects, and that certainly won't play well with sql. so my first thought was to put the dimension tables in a mysql database and simply store all the impressions in a scylla keyspace, and then use hive or spark or some other intermediary to join tables across the databases.

but this seems like it could be really slow, no? and beyond that, i watched some videos and read various sites about nosql theory and data modeling, and i feel the aforementioned approach may not be ideal. maybe what i want to do is somehow flatten all of the relevant data into one giant table in scylla

does that sound more like the right way to employ nosql in this instance? querying one giant table would obviously be easier than querying multiple tables across two databases. i certainly have to imagine it would be much, much faster. but wouldn't this end up taking a ton of space? there'd be tons of redundant data. it just feels to me and my sql-molded mind like all of that is meant for a lookup table.

and what if i needed to change, say, a campaign name? obviously in sql, i'd update a single field. but in one giant table i'd have to change it in potentially billions of rows. wouldn't that be a massive burden for the server? or is that ok in nosql world?

# ? Mar 23, 2019 20:52

PIZZA.BAT: Nov 12, 2016

Again- I don't know about the specific technologies you're using but most document stores don't naively store documents as-is. They should have some sort of compression occurring so that if you have a lot of redundant data you will only really need to worry about it when it's in memory. Also when I see client's starting to worry about 'lots of redundant data' I find that when I drill down into what they're doing it's because they're still using old sql habits. For example- say you have a primary key that's used as an identifier across dozens or hundreds of different tables. A lot of times I'll see clients storing that identifier all throughout the document as they feel they need to keep it in every location it showed up in the original tables. Why? You only need to have it stored once at the root and that's it.

Also keep in mind that if you have repeating elements you don't have to dedicate a field to describing what the particular element is. This is necessary in the SQL world but there's a more elegant way of storing it in a document. Say for example addresses. Your SQL table could look something like this:

USER_PK ADDRESS_TYPE ADDRESS 61872   HOME         '123 FAKE STREET' 61872   WORK         '456 EXAMPLE LN'

The naive approach to turning this into a document would look something like this

{   'Person': {     'ID': '61872',     'ADDRESSES':[     {       'USER_PK':'61872',       'ADDRESS_TYPE':'WORK',       'ADDRESS':'123 FAKE ST'     },     {       'USER_PK':'61872',       'ADDRESS_TYPE':'HOME',       'ADDRESS':'456 EXAMPLE LN'     }     ]   } }

However all you're really doing here is storing a relational table in a big document. Don't forget that your element names are allowed to be descriptive here!

{   'Person': {     'ID': '61872',     'WORK_ADDRESS':'123 FAKE ST',     'HOME_ADDRESS':'456 EXAMPLE LN'   } }

It's a very simple example so hopefully I'm getting my point across. You'll find that you'll be able to make your documents much more information-dense if you take some time to sit down and think how to best represent the data as a document rather than copying over a bunch of relational tables. As a rule of thumb- just ask yourself how you would want the data to look if this were something being printed on an actual piece of paper for a human to read and use. That's usually the best direction

# ? Mar 23, 2019 21:30

abelwingnut: Dec 23, 2002

thanks, again

the second project i mentioned will use scylla for its database, which is a wide column store, not a document store. it's basically cassandra written in c.

in any case, what you wrote makes sense to me for document stores. i imagine i can adopt some of those same practices for what i'm trying to do with the scylla project. the way we first approached this idea was treating scylla as basically a transaction log that can be queried off of. in that light, repeated data would be something like:

id�|�campaign��| datetime | hash 01�| campaignWithLongNameA | 20190322 | wxyz 02�| campaignWithLongNameA�| 20190322 | abcd 03�| campaignWithLongNameA�| 20190322 | hijk 04�| campaignWithLongNameB�| 20190322 | lmno

so yea, i think my thing was this: in sql land, i could just use the integer ids of campaignWithLongNameA and campaignWithLongNameB instead, which will take up less space. and then, when i need to query i could just join the right lookup table. but denormalizing it and creating this massive table will require the actual names in the rows, and that's obviously going to take up more space

but i guess i just need to figure out the right model that prevents so much redundancy.

abelwingnut fucked around with this message at 23:33 on Mar 23, 2019

# ? Mar 23, 2019 23:16

Razzled: Feb 3, 2011; MY HARLEY IS COOL

One thing to consider is that the data model for Cassandra is dependent on the queries. It's not a great choice for ad-hoc querying but it IS a great choice for high ingest workloads or purpose-built lookup reads. In fact, because writes are so cheap in Cassandra it's better to ingest data as denormalized with little regard for duplication (since other methods of joining data is likely to be more expensive than reading a bigger row)

You should be building tables based on what questions your application needs answered and the queries should not deviate from those specs.

# ? Mar 25, 2019 19:52

Arcsech: Aug 5, 2008

Razzled posted:

One thing to consider is that the data model for Cassandra is dependent on the queries. It's not a great choice for ad-hoc querying but it IS a great choice for high ingest workloads or purpose-built lookup reads. In fact, because writes are so cheap in Cassandra it's better to ingest data as denormalized with little regard for duplication (since other methods of joining data is likely to be more expensive than reading a bigger row)

You should be building tables based on what questions your application needs answered and the queries should not deviate from those specs.

This is absolutely true.

Cassandra is great IF:

You need to dump a bunch of data into it really, really fast and don't care how long querying it takes, OR
You need very fast queries, but have a relatively short list of the very specific kinds of queries that will ever be run, and don't care how long querying it takes outside of that list.

Because if you're doing anything other than effectively

code:

SELECT * FROM my_table WHERE partition_key IN <list> AND clustering_key IN <range>

You're doing a full table scan. (Note: last time I touched C*, SASI indexes weren't really a thing yet. This may have changed the situation somewhat)

e: The way it was introduced to me, which isn't a bad way of thinking about it if you're new to C*, is: If you would use an index in an RDBMS, you create a new table in C* and denormalize your data into the new table.

Arcsech fucked around with this message at 20:55 on Mar 25, 2019

# ? Mar 25, 2019 20:51

skimothy milkerson: Nov 19, 2006

Abel Wingnut posted:

huh? seriously not sure what you mean. or maybe this is in reference to another post?

oh jeeze im so sorry. i thought this was YOSPOS. i was phone posting. my mistake

# ? Mar 26, 2019 02:33

PIZZA.BAT: Nov 12, 2016

Skim Milk posted:

oh jeeze im so sorry. i thought this was YOSPOS. i was phone posting. my mistake

no silly i linked it from yospos

# ? Mar 26, 2019 02:44

skimothy milkerson: Nov 19, 2006

Rex-Goliath posted:

no silly i linked it from yospos

turn your monitor on

# ? Mar 26, 2019 02:51

limaCAT: Dec 22, 2007; il pistone e male; Slippery Tilde

I was starting this post by saying that I was ready to throw the sponge at understanding couchdb views.
However I managed to create some views that return the data I actually want to. Yay! :neckbeard:

Anyway:

use fauxton: https://docs.couchdb.org/en/stable/fauxton/index.html. With curl lies madness and desperation
as above, use fauxton, it will allow you to update your loving map functions with pretty print instead of fighting json's "string are all in a single line, therefore function"
if in the map function you emit one field in a key you can do queries like this: http://localhost:5984/shitposts/_design/the_document/_view/view_with_one_field?startkey=1554016648&endkey=1559996648
if in the map function you emit multiple fields in a key you can do queries like this: [url]http://localhost:5984/shitposts/_design/the_document/_view/view_with_two_field?startkey=["foo", 1554016648]&endkey=["bar", 1559996648][/url]

# ? Mar 29, 2019 19:31

Mr Shiny Pants: Nov 12, 2012

So I got Kafka working and I am really digging working with it. I've built my own producer and consumer in F# and it works swimmingly.
I am just sending raw stuff over right now and I've been reading up on serialization, is Avro any good?
I really like the idea of a message schema and having a central registry handling it all, otherwise I'll need to create a F# domain specific for the messages that are sent over.

# ? Apr 3, 2019 21:21

BabyFur Denny: Mar 18, 2003

Both avro or protobufs are a good idea to use. Pick one and stick with it.

# ? Apr 3, 2019 21:55

30.5 Days: Nov 19, 2006

Hey, I have a system that needs to pull up users by display name. Display names are normally a jumble of nonsense. I'd like the search to have fuzzy search and prefix search- my ideal experience is visual studio's intellisense, where it can see what you're trying to type halfway through the symbol name, even if you've made a mistake. Elasticsearch hasn't been super helpful, because fuzzy search in ES is 100% distance based- so it kind of needs you to type in the entire symbol before it can find anything. I've used a query that has a weighting between fuzzy & wildcard search- so it can find things that are complete and incorrect, or incomplete and correct, but it can't find things that are incomplete and incorrect.

Is there anyone that understand elasticsearch better and knows of a better solution? Is there another search implementation that would work better, or is this something where I need to like build my own thing to make this work how I want?

# ? Apr 20, 2019 08:15

PIZZA.BAT: Nov 12, 2016

Can you give more info on what you mean by �jumble of nonsense�?

# ? Apr 20, 2019 15:55

Arcsech: Aug 5, 2008

Some examples would be useful, and there�s a way that might be what you�re looking for but it�s generally not recommended because it uses up considerably more memory and disk.

# ? Apr 20, 2019 17:08

30.5 Days: Nov 19, 2006

Rex-Goliath posted:

Can you give more info on what you mean by �jumble of nonsense�?

Stuff like ThisIsAVeryLongUserName12341

The sort of thing that elasticsearch's defaults, which are based around a corpus of english text, does terribly with. I'd like someone who's typed in "thisisavrye" to get that back as a result

# ? Apr 20, 2019 18:30

Arcsech: Aug 5, 2008

30.5 Days posted:

Stuff like ThisIsAVeryLongUserName12341

The sort of thing that elasticsearch's defaults, which are based around a corpus of english text, does terribly with. I'd like someone who's typed in "thisisavrye" to get that back as a result

I�m mobile at the moment so I can�t go too in-depth but look into the NGram tokenizer, it can be easy to shoot yourself in the foot with but if you�re only using it on short fields (like usernames) it should get you closer to what you want.

# ? Apr 20, 2019 21:59

Cancelbot: Nov 22, 2006; Canceling spam since 1928

Fluue posted:

Got a question about DynamoDB / key-value denormalization and managing deletes/updates.

I'm trying to implement an RBAC-esque system using AWS Lambda and DynamoDB as the storage backend. There are a few intermediate services that also store data, but are unimportant here.

I've denormalized the User -> has many Roles -> (Roles) have many Permissions relationship into this:

Where the via_role is an array of roles that the user received a specific permission from.

The problem lies with managing role removals. If a role is revoked from a role I need to either delete the permission record if the user only has that permission from the role being revoked, or just remove the role from the permission record is that permission was also granted by another role. I need to do this for each permission the revoked role has. This also applies if the role is deleted from the system; I would need to go through every user that has that role and perform the same operation.

Am I on the right track here with the key-value denormalization of this hierarchical, almost-graph-database design?

The model i've seen for DynamoDB relations is adjacency list style where your table contains many "virtual tables" and the partition key acts as your table-identifier with a prefix that you can easily eliminate on queries (I'm using the pound/hash "#" as a split character but anything will do):

pre:

PARTITION KEY        => SORT KEY
user#email@email.com => permission A
user#email@email.com => permission B
role#admin           => email@email.com
role#admin           => email2@email.com
role#reader          => email@email.com

Then when you delete a role you execute a query for the "role#" prefixed and just nuke each item that has it. When you check "is in role?" you would do the same and say "get me all partition keys with role#reader" that way your're being efficient and maximising DynamoDB's strength which is it's key oriented design and you can query the partition without the secondary index.

Now with pictures! I've faked what I think your data structure is but as virtual tables;

And if query on role only:

If you need to get user and permission string then you concatenate the secondary index with both values; email@email.com#permissionABCD and split the values, or just have it as an attribute.

For true monstrosities check out 16 tables boiled into a single DynamoDB table using a similar method: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-modeling-nosql-B.html

Cancelbot fucked around with this message at 19:50 on May 9, 2019

# ? May 9, 2019 19:41

Kudaros: Jun 23, 2006

I'm a data scientist coming from an academic background. Good on the machine learning, stats, etc., but not so great with databases. I can query relational databases well enough but I'm thinking now about how to organize enterprise data. This isn't my role, but apparently it is nobody's role at my company, at this point.

There are schema strewn about in pl/sql, postgres, MSSQL server, for every line of business, every variation and merger of the company over the past 30 years, and often times for various clients. It is an unbearable (and undocumented!) mess. I'm not even really sure how to reverse engineer all of it.

Allegedly someone is working on a datalake, but I've no idea what that's realistically going to look like. Is there a general workflow for mashing this all together and curating portions of it for purposes of streamlined analytics?

# ? May 10, 2019 18:05

abelwingnut: Dec 23, 2002

sounds awful. maybe spark or presto can act as the go-between query engine that connects the various systems?

# ? May 10, 2019 21:25

PIZZA.BAT: Nov 12, 2016

Kudaros posted:

I'm a data scientist coming from an academic background. Good on the machine learning, stats, etc., but not so great with databases. I can query relational databases well enough but I'm thinking now about how to organize enterprise data. This isn't my role, but apparently it is nobody's role at my company, at this point.

There are schema strewn about in pl/sql, postgres, MSSQL server, for every line of business, every variation and merger of the company over the past 30 years, and often times for various clients. It is an unbearable (and undocumented!) mess. I'm not even really sure how to reverse engineer all of it.

Allegedly someone is working on a datalake, but I've no idea what that's realistically going to look like. Is there a general workflow for mashing this all together and curating portions of it for purposes of streamlined analytics?

This is what I specialize in & do for a living. First thing you need to do is determine if you want a data federation, a data lake, or a data hub. He goes into what those are and their pros and cons here:

https://www.marklogic.com/blog/data-lakes-data-hubs-federation-one-best/

Answering the question of which of those three you need first will help guide which tech and processes you'll need to adopt.

# ? May 11, 2019 01:32

skimothy milkerson: Nov 19, 2006

Kudaros posted:

I'm a data scientist coming from an academic background. Good on the machine learning, stats, etc., but not so great with databases. I can query relational databases well enough but I'm thinking now about how to organize enterprise data. This isn't my role, but apparently it is nobody's role at my company, at this point.

There are schema strewn about in pl/sql, postgres, MSSQL server, for every line of business, every variation and merger of the company over the past 30 years, and often times for various clients. It is an unbearable (and undocumented!) mess. I'm not even really sure how to reverse engineer all of it.

Allegedly someone is working on a datalake, but I've no idea what that's realistically going to look like. Is there a general workflow for mashing this all together and curating portions of it for purposes of streamlined analytics?

hello fellow coworker

# ? May 13, 2019 03:49

Kudaros: Jun 23, 2006

Rex-Goliath posted:

This is what I specialize in & do for a living. First thing you need to do is determine if you want a data federation, a data lake, or a data hub. He goes into what those are and their pros and cons here:

https://www.marklogic.com/blog/data-lakes-data-hubs-federation-one-best/

Answering the question of which of those three you need first will help guide which tech and processes you'll need to adopt.

This is a great article for someone like me, thanks!

Easiest solution might be new job though.

# ? May 13, 2019 17:10

Arcsech: Aug 5, 2008

Elastic release 6.8 and 7.1 of Elasticsearch (and all the rest) today, which as a surprise has authentication and role-based access control in the free version: https://www.elastic.co/blog/security-for-elasticsearch-is-now-free

This is going to be a pretty big deal to a lot of people, because previously you either had to use a third-party plugin or proxy Elasticsearch to get auth on the free version.

# ? May 21, 2019 02:09

Progressive JPEG: Feb 19, 2003

Razzled posted:

Anyone have any good reading or class suggestions for data modeling knowledge with Cassandra? It's an area where I have so little experience that almost all of my suggestions or work in that area amounts to trial and error. I understand the basics but when it comes to best practices etc I'm just totally in the dark

Hello from a year later. Datastax has a bunch of tutorial information, could start by looking at that. For actually designing a table the main bits are figuring out keying (order is important!) and picking a compaction strategy that aligns with the expected workload.

# ? Jun 14, 2019 23:25

Pollyanna: Mar 5, 2005; Milk's on them.

We have a collection of roughly 43k documents (one for each US zip code), where one of the fields in the documents is an array of IDs that can be arbitrarily large (e.g. there can be like up to 500 unique integers in the field). We need to index on that field of IDs in order to make iteration on the scope of documents with ID x in them, so we can remove them all and place the ID back if it should in fact be in there.

Unfortunately, we want to migrate to DocumentDB, and due to the fact that a 500-element large array is apparently bigger than the 2kB index key size limit, we�re unable to migrate our ID array index, so our latency would shoot up. It doesn�t look like AWS offers any compression on index keys either, so we�re kinda boned.

What�s a better way to iterate through a collection and remove value X from a large array of possible Xs? Is this something that can be solved by a smarter updating strategy? Or is it something that Mongo/DocDB just isn�t a good fit for? What are my options?

# ? Jun 28, 2019 18:45

ThePeavstenator: Dec 18, 2012; Establish the Buns

Pollyanna posted:

We have a collection of roughly 43k documents (one for each US zip code), where one of the fields in the documents is an array of IDs that can be arbitrarily large (e.g. there can be like up to 500 unique integers in the field). We need to index on that field of IDs in order to make iteration on the scope of documents with ID x in them, so we can remove them all and place the ID back if it should in fact be in there.

Unfortunately, we want to migrate to DocumentDB, and due to the fact that a 500-element large array is apparently bigger than the 2kB index key size limit, we�re unable to migrate our ID array index, so our latency would shoot up. It doesn�t look like AWS offers any compression on index keys either, so we�re kinda boned.

What�s a better way to iterate through a collection and remove value X from a large array of possible Xs? Is this something that can be solved by a smarter updating strategy? Or is it something that Mongo/DocDB just isn�t a good fit for? What are my options?

Sounds like you're looking for array contains.

code:

Select * from c
Where ARRAY_CONTAINS('idtosearch', c.idListField)

Unless you really need to squeeze out a little extra write performance, you shouldn't need to define any custom indexes on CosmosDB.

I extensively use CosmosDB and it should have no problem with documents of the size and quantity you described. By "index key" you're not referring to the partition key are you?

ThePeavstenator fucked around with this message at 19:44 on Jun 28, 2019

# ? Jun 28, 2019 19:21

Pollyanna: Mar 5, 2005; Milk's on them.

AWS DocumentDB, sorry.

# ? Jun 28, 2019 20:55

ThePeavstenator: Dec 18, 2012; Establish the Buns

rip that's what I get for not reading closely

# ? Jun 28, 2019 21:01

Pollyanna: Mar 5, 2005; Milk's on them.

The more I think about our MongoDB collections and what we do with them, the more I wish we used a relational database instead.

# ? Jun 28, 2019 21:25

Arcsech: Aug 5, 2008

Pollyanna posted:

AWS DocumentDB, sorry.

I have to ask... Why do you want to migrate to AWS DocumentDB? Everything I have heard about it is that it is a bad reimplementation of roughly half of the MongoDB protocol backed by, effectively, an Aurora instance. I'm genuinely curious why anyone would ever use it - if you want to use the MongoDB protocol, why not just use hosted MongoDB?

Pollyanna posted:

The more I think about our MongoDB collections and what we do with them, the more I wish we used a relational database instead.

Yeah, a lot of MongoDB projects end up feeling that way. My last one sure did.

Arcsech fucked around with this message at 21:59 on Jun 28, 2019

# ? Jun 28, 2019 21:56

Pollyanna: Mar 5, 2005; Milk's on them.

Arcsech posted:

I have to ask... Why do you want to migrate to AWS DocumentDB? Everything I have heard about it is that it is a bad reimplementation of roughly half of the MongoDB protocol backed by, effectively, an Aurora instance. I'm genuinely curious why anyone would ever use it - if you want to use the MongoDB protocol, why not just use hosted MongoDB?

Request from on high, but aside from that, it�s because we want a managed service instead of having to host our own boxes, set our own sharding and availability, etc. A more turnkey solution, basically. And ideally, we want to do it within AWS. If there are alternatives, I�d love to hear about them, but...

Arcsech posted:

Yeah, a lot of MongoDB projects end up feeling that way. My last one sure did.

...if it was up to me, we wouldn�t even bother with DocumentDB. We�d just consolidate into our Postgres DB.

Pollyanna fucked around with this message at 22:21 on Jun 28, 2019

# ? Jun 28, 2019 22:16

Arcsech: Aug 5, 2008

Pollyanna posted:

Request from on high, but aside from that, it�s because we want a managed service instead of having to host our own boxes, set our own sharding and availability, etc. A more turnkey solution, basically. And ideally, we want to do it within AWS. If there are alternatives, I�d love to hear about them, but...

Yeah, that's probably going to be a disaster. While this does originate from MongoDB (the company) and could be biased, apparently many features don't work the same between MongoDB and DocumentDB so you'll likely be fighting incompatibilities for a long time.

I haven't used it myself, but if it was me, I'd be looking into MongoDB Atlas for hosted MongoDB, which you can run inside AWS but I"m not sure if you can pay through AWS, some services like that you can and some you can't.

# ? Jun 28, 2019 23:26

ThePeavstenator: Dec 18, 2012; Establish the Buns

Pollyanna posted:

Request from on high, but aside from that, it�s because we want a managed service instead of having to host our own boxes, set our own sharding and availability, etc. A more turnkey solution, basically. And ideally, we want to do it within AWS. If there are alternatives, I�d love to hear about them, but...

...if it was up to me, we wouldn�t even bother with DocumentDB. We�d just consolidate into our Postgres DB.

If you're looking for something turnkey on AWS, why not DynamoDB? I'm not very familiar with AWS DocumentDB but on first glance at the docs, it looks like it's somewhere between fully managing your own Mongo databases and using something fully managed like DynamoDB.

# ? Jun 29, 2019 00:42

Adbot: ADBOT LOVES YOU

# ? May 3, 2024 17:04

Pollyanna: Mar 5, 2005; Milk's on them.

Turns out we didn�t need that index at all. The _new_ problem is that somewhere in our massive collection of crap is a string with a null character in it - and I have no idea where. All my attempts at dumping the collection and grepping for \0 have failed. What�s the easiest way to search through a collection and find any fields that gave a string with a null character in it?

# ? Jul 10, 2019 21:46

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > NoSQL : This Thread is Web Scale

«‹›4 »