Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
coffeetable
Feb 5, 2006

TELL ME AGAIN HOW GREAT BRITAIN WOULD BE IF IT WAS RULED BY THE MERCILESS JACKBOOT OF PRINCE CHARLES

YES I DO TALK TO PLANTS ACTUALLY

Pie Colony posted:

if the company is well-known or has room to be picky with candidates, chances are they don't really just want a "sum these up, sort and return the first 10" solution. they are probably looking for candidates that know to use a heap to do the sorting in n log k, and probably someone that does all the counting in parallel.
this is a screening test. they don't care in the least about which sorting algorithm you're using. they literally do just want you to demonstrate you can sum, sort and return the first 10.

worry about that poo poo at the actual interview

Adbot
ADBOT LOVES YOU

Malcolm XML
Aug 8, 2009

I always knew it would end like this.

syntaxrigger posted:

So an ML shop contacted me because I have java in my resume and I have lofty goals of one day not sucking at ML. They sent me a programming test that I did horrible in because I am not strong enough in when to use what data structures and algorithims. So instead of wallowing coffetable convinced me to post here for some insight into what I should have done.

.tsv file

Tasks:
The input is a text file containing newline separated strings.
Each string is a word, followed by a tab, followed by a topic.

1. Load the data into memory and print the total number of lines. - DONE

2. Print the 10 most common lines.

3. Print the 10 most common topics, sorted by commonness. A topic consists of the second part of each line, where each line is
independent.

4. Print the 10 most common words, sorted alphabetically. A word consists of the first part of each line, where each line is independent.

5. Print the 10 most common words for each of the top 10 most common topics.

6. Write a function to compute the probability of a topic, given a word. In your program, print this value for the 100 pairs of words and topics from #5.

7. Write a function to compute the probability of a word, given a topic. In your program, print this value for the 100 pairs of words and topics from #5.

8. Write into your program a parameter that specifies whether the data should be treated case insensitively (treat "Apple" and "appLE" as the same), or case sensitively.

9. Write into your program a parameter which specifies the minimum number of times a line must be seen to be considered.

10. Print the frequency of each word count. For example, there are 10 words than occur 20 times, there are 12 words that occur 19 times, and so on.

My code for the second task so far:
code:
public class TopTenNames {
	
	public TopTenNames()throws IOException{		
		Path path = FileSystems.getDefault().getPath("./", "words_and_topics.tsv");
	    BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);
	    String line = reader.readLine();
	    
	    //loop
	    while(line != null){	   
	    	//Decide if line is dupe or not and increment occurance of line
	    	
	    	line = reader.readLine();
	    	
	    }
	    	    
	}
	
	//Hacky way to throw an exception, refactor if I have time.	
	public static void main(String[] args) throws IOException {
		TopTenNames ttn = new TopTenNames();

	}

}

I knew I probably needed two data structures but I kept going in circles about which ones to use. Trimmed the code because it is in a broken state atm.

So, in general, what is a good way to go about the 2nd task?

I couldn't decide if I should store the entire file in a variable then try to sort the variable into a collection of lines and occurrences_of_lines then pop the top ten or do something else entirely. I kept panicking because I didn't know the 'best way' to do this. I should have focused on 'a way' to do this and worry about optimization later. I always do that stupid poo poo.

Open to constructive suggestions.

FYI the test is done I turned in my lovely one task solution and they won't be calling back so, no worries about helping me get a job I don't deserve.

load it into sqlite or something, virtually all of those can be done in a single query each. sqlite3 shell has a tsv loading command i think

Forums Terrorist
Dec 8, 2011

coffeetable posted:

why're you intimidated by that ML syllabus

yeah that's because you need more practice

you sound like my mum when she pushed me to do ib standard maths instead of studies because "you don't want to close doors!!!!" even tho i didn't meet the prereqs

it was an entirely predictable catastrofuck and it ended up forcing me to do some godawful hybrid ap/ib thing for my last year of hs

BONGHITZ
Jan 1, 1970

math is not easy for anyone

fart simpson
Jul 2, 2005

DEATH TO AMERICA
:xickos:

what's ib

Forums Terrorist
Dec 8, 2011

international baccalaureate, it's like the ap but for europe

Bloody
Mar 3, 2013

lol if you do serious ml work in languages that are slow, aka not in c++ or fortran

coffeetable
Feb 5, 2006

TELL ME AGAIN HOW GREAT BRITAIN WOULD BE IF IT WAS RULED BY THE MERCILESS JACKBOOT OF PRINCE CHARLES

YES I DO TALK TO PLANTS ACTUALLY
lol if you don't just call out to c++ or fortran for anything that actually needs to be fast

Bloody
Mar 3, 2013

coffeetable posted:

lol if you don't just call out to c++ or fortran for anything that actually needs to be fast

Bloody
Mar 3, 2013

assume a fitness function f that maps a languages ability to:
fart around in a file system
janitor strings
call out w/ c calling convention

solve for the global maximum

Bloody
Mar 3, 2013

my current solution is c# but i think its just a local maximum

coffeetable
Feb 5, 2006

TELL ME AGAIN HOW GREAT BRITAIN WOULD BE IF IT WAS RULED BY THE MERCILESS JACKBOOT OF PRINCE CHARLES

YES I DO TALK TO PLANTS ACTUALLY
im p impressed w python so far. the language itself is garbage but scikit-learn and pymc are both really good

found a cluster in some medical trial data today that another company had completely missed. no idea if it means anything yet - we're blinded - but gently caress yeah variational methods

e: it was the size of a house as well. from what i can tell they just banged it through PCA and some hierarchical clustering heuristic and went "welp nuttin there". half of it looked like it was done in excel too

coffeetable fucked around with this message at 19:20 on Sep 16, 2014

Bloody
Mar 3, 2013

"data science" is a dogwhistle for pca

FamDav
Mar 29, 2008

coffeetable posted:

im p impressed w python so far. the language itself is garbage but scikit-learn and pymc are both really good

found a cluster in some medical trial data today that another company had completely missed. no idea if it means anything yet - we're blinded - but gently caress yeah variational methods

e: it was the size of a house as well. from what i can tell they just banged it through PCA and some hierarchical clustering heuristic and went "welp nuttin there". half of it looked like it was done in excel too

how did you find it

coffeetable
Feb 5, 2006

TELL ME AGAIN HOW GREAT BRITAIN WOULD BE IF IT WAS RULED BY THE MERCILESS JACKBOOT OF PRINCE CHARLES

YES I DO TALK TO PLANTS ACTUALLY

FamDav posted:

how did you find it
mixture of gaussians w/ conjugate prior. as described in MLaPP chap 21, which i was lucky enough to read last week, and as implemented here

http://scikit-learn.org/stable/modules/dp-derivation.html

Bloody
Mar 3, 2013

ok seriously what does it take for my eyes to not immediately gloss over at the sight of unfamiliar greek letters

where the gently caress did i go so wrong

Bloody
Mar 3, 2013

thatd be 10000x more readable to me if instead of

they actually wrote the poo poo out

lmao transparency

fritz
Jul 26, 2003

Bloody posted:

thatd be 10000x more readable to me if instead of

they actually wrote the poo poo out

lmao transparency

theres only 24 greek letters, plus a few capitals, phi mu and sigma are pretty common

and hey they did write out Gamma instead of using an actual Gamma

fritz
Jul 26, 2003

fritz posted:

theres only 24 greek letters, plus a few capitals, phi mu and sigma are pretty common

and hey they did write out Gamma instead of using an actual Gamma

looking further they use nu, which is a real nu-sance

Bloody
Mar 3, 2013

24 symbols my brain processes in idiotic ways
a, b, y (uppercase is just a travesty tho), triangle, e, squiggle z, n, theta (lol if you dont theta), i, k, eigenthing, micro, v, jesus christ what is this letter, o, pi, stupid p, capital sigma obvi but those other two just lol (jiffy lube) (why the gently caress are there two), t, u, hosed up pitchfork, x, pitchfork, ohms

going off the first table in https://en.wikipedia.org/wiki/Greek_alphabet

Bloody
Mar 3, 2013

idk id just always rather read pseudocode than maths because my background is stronger in codethings than mathsthings

distortion park
Apr 25, 2011


Bloody posted:

24 symbols my brain processes in idiotic ways
a, b, y (uppercase is just a travesty tho), triangle, e, squiggle z, n, theta (lol if you dont theta), i, k, eigenthing, micro, v, jesus christ what is this letter, o, pi, stupid p, capital sigma obvi but those other two just lol (jiffy lube) (why the gently caress are there two), t, u, hosed up pitchfork, x, pitchfork, ohms

going off the first table in https://en.wikipedia.org/wiki/Greek_alphabet

zeta is a girl with a ponytail on a swing, xi is the same with arms sticking out.

there are two small sigmas because you use a different one at the end of a word!

distortion park
Apr 25, 2011


i don't know how people program stuff without intellisense and linq, i can barely program with them

coffeetable
Feb 5, 2006

TELL ME AGAIN HOW GREAT BRITAIN WOULD BE IF IT WAS RULED BY THE MERCILESS JACKBOOT OF PRINCE CHARLES

YES I DO TALK TO PLANTS ACTUALLY

Bloody posted:

thatd be 10000x more readable to me if instead of

they actually wrote the poo poo out

lmao transparency
i don't think itd help if latin letters were used. it'd actually be harder to interpret bc you wouldn't have the context of "mu's always a mean". if you don't have that context anyway, latin letters prob wouldn't help bc the distributions would still be meaningless to you.

for the record though, here it is in english:

each sample (X_i) is drawn from a normal distribution. which set of parameters are used for that normal distribution is decided by the indicator variable for that sample, z_i. that indicator is in turn drawn from a Dirichlet process* whose hyperparameters are drawn from a beta distribution.

alternatively: the Dirichlet process models the assignment of samples to clusters, the normal-gamma distributions model the parameters for each cluster, and the normal distribution models the samples within each cluster

*aka stick-breaking process, which is why it's denoted SBP

BONGHITZ
Jan 1, 1970

they are just pictures, its just like learning chinese

bobbilljim
May 29, 2013

this christmas feels like the very first christmas to me
:shittydog::shittydog::shittydog:

pointsofdata posted:

i don't know how people program stuff without intellisense and linq, i can barely program with them

me @ work :

*hunt and peck 2 or 3 letters*

*mash ctrl+space*

then i take 5

Share Bear
Apr 27, 2004

Bloody posted:

idk id just always rather read pseudocode than maths because my background is stronger in codethings than mathsthings

yeah, i browsed some math-y books on coworkers desks and there's the assumption that you know what all these symbols mean or imply already without even defining other stuff or what it's supposed to do

fritz
Jul 26, 2003

Share Bear posted:

yeah, i browsed some math-y books on coworkers desks and there's the assumption that you know what all these symbols mean or imply already without even defining other stuff or what it's supposed to do

some books have a table with their particular notations in the introduction

Jenny Agutter
Mar 18, 2009

fritz posted:

some books have a table with their particular notations in the introduction

or on the inside of the back cover

Workaday Wizard
Oct 23, 2009

by Pragmatica
hello,

first, thanks shaggar for the advice earlier. i got a project up and running with all the nice stuff maven and mybatis provide.

I have a doubt, is using a singleton for my mybatis SqlSessionFactory appropriate?

here is the singleton btw:
Java code:
public class PlaySqlSessionFactory
{
	private static SqlSessionFactory sqlSessionFactory = null;

	/**
	 * Create the SQL Session Factory singleton 
	 */
	private static void create()
	{
		try
		{
			String resource = "mybatis-config.xml";
			InputStream inputStream;
			inputStream = Resources.getResourceAsStream(resource);
			sqlSessionFactory = new SqlSessionFactoryBuilder().build(inputStream);
		}
		catch (IOException e)
		{
			// FIXME: Log the error
			sqlSessionFactory = null;
		}
	}

	/**
	 * Get the SQL Session Factory singleton
	 * @return the singleton (could be null for failure) 
	 */
	public static SqlSessionFactory get()
	{
		if(sqlSessionFactory == null)
			create();
		
		return sqlSessionFactory;
	}
}

Shaggar
Apr 26, 2006
eh, that's one way to do it I guess. I prefer to configure the datasource in my spring config and let mybatis-spring (another lib) generate proxies for my mapping interfaces. then I just inject those into where I use them.

so if I have an interface like

Java code:
public interface SomeDataMapper
{
	public abstract Whatever getThing(SomeOtherClass parameter);
}
in ur mapping xml u'll have a mapping config named SomeDataMapper and then a statement in there named getThing that handles the mapping of the SomeOtherClass parameter to the inputs of the statement and then handles the mapping back to Whatever

Obviously you can implement the SomeDataMapper interface yourself with a singleton to generate sqlsessions that you use inside the implementation of the getThing method. But its gonna be dumb boilerplate, so mybatis-spring comes to the rescue.

w/ mybatis-spring you add a line that loads the spring mybatis mapper thingy (idr the real name) and it hunts down ur mapper xmls and matches them up with your mapper interfaces and builds proxy objects in the bean container with the same name as your interface. so in our example this would be a bean named someDataMapper.

then wherever you need to use that interface you inject the proxy bean into your code from spring.

Idk if that helps cause idk how much spring you know. Just sticking w/ the singleton is probably fine, but learning spring has a load of advantages once you get it.

DONT THREAD ON ME
Oct 1, 2002

by Nyc_Tattoo
Floss Finder

Forums Terrorist posted:

international baccalaureate, it's like the ap but for europe

and helicopters families in the us

MeruFM
Jul 27, 2010
Serious question, what is a factory and how is it different from a constructor

Soricidus
Oct 21, 2010
freedom-hating statist shill

MeruFM posted:

Serious question, what is a factory and how is it different from a constructor
a factory is a method that is used to acquire an instance of a class or interface.

it's different from a constructor because the object it returns may not always be the same class and may not always be a new instance.

ii oh el
Jan 9, 2007

MeruFM posted:

Serious question, what is a factory and how is it different from a constructor

a factory is good for returning different versions of an object depending on state.

like let's assume insurance is a godawful cluster gently caress, i know farfetched. there's hundreds of different companies and thousands of different policies and you're trying to get the billing on a root canal and capping so that the dentist can buy his trophy wife a make-up bag.

billings are done completely differently by each company and are further differentiated by each policy. luckily each person has their policy id and insurance company in the record so you just call GitMoneyFactory.dollaDollaBillzYo() with your person object and it returns the corresponding GitMoney class capable of doing all that poo poo you need to do for FYGM insurance company that the patient belongs to.

if you had used a constructor instead youd need a switch with thousands of cases.

Janitor Prime
Jan 22, 2004

PC LOAD LETTER

What da fuck does that mean

Fun Shoe

ii oh el posted:

a factory is good for returning different versions of an object depending on state.

like let's assume insurance is a godawful cluster gently caress, i know farfetched. there's hundreds of different companies and thousands of different policies and you're trying to get the billing on a root canal and capping so that the dentist can buy his trophy wife a make-up bag.

billings are done completely differently by each company and are further differentiated by each policy. luckily each person has their policy id and insurance company in the record so you just call GitMoneyFactory.dollaDollaBillzYo() with your person object and it returns the corresponding GitMoney class capable of doing all that poo poo you need to do for FYGM insurance company that the patient belongs to.

if you had used a constructor instead youd need a switch with thousands of cases.

You missed a big part, the GitMoneyFactory is the one that actually contains the switch statement with thousands of cases. The point is that the terrible clusterfuck of logic is encapsulated so that you don't have to copy and paste it everywhere in your code that wants to just get paid.

ii oh el
Jan 9, 2007

Janitor Prime posted:

You missed a big part, the GitMoneyFactory is the one that actually contains the switch statement with thousands of cases. The point is that the terrible clusterfuck of logic is encapsulated so that you don't have to copy and paste it everywhere in your code that wants to just get paid.

you're right i should've explained that. thanks for pointing it out. i sometimes think that i imply connections clearly but on reflection haven't at all.

Vincent K. McMahon
Sep 13, 2014

by Ralp
so ive got a question. I am making something that revolves around people comitting code to github. problem is i know most people wont commit and push frequently enough for my site to be..accurate.

so tell me how people SHOULD be using version control, no smart rear end answers please!!!

should it be commit often and push often, donrt let too many commits build up locally before pushing them?

Janitor Prime
Jan 22, 2004

PC LOAD LETTER

What da fuck does that mean

Fun Shoe
It's not just about the number of commits, but the stability of the feature your working on and the impact it has on other people who will pull your changes in a half finished state. This is why branching is important, so that they don't fuckup the main line of development, but are free to push their changes whenever they want and in your case you can feel better about telling them to push often so that your tracking thing shows up to date information.

Adbot
ADBOT LOVES YOU

orphean
Apr 27, 2007

beep boop bitches
my monads are fully functional

Vincent K. McMahon posted:

so ive got a question. I am making something that revolves around people comitting code to github. problem is i know most people wont commit and push frequently enough for my site to be..accurate.

so tell me how people SHOULD be using version control, no smart rear end answers please!!!

should it be commit often and push often, donrt let too many commits build up locally before pushing them?

you dont push 'commits' per se you push a completed item (where item is a bug fix or a feature or whatever) They should be working in some local branch on their crap then when its ready to go merge/rebase/whatever it into master and push from there once for the whole item.

  • Locked thread