GitHub - aaliang/ArtistPairs: dataset of artists for association pairings

@author aliang

The attached utf-8 encoded text file contains the favorite musical artists of 1000 users from LastFM. Each line is a list of up to 50 artists, formatted as follows:

Radiohead,Pulp,Morrissey,Delays,Stereophonics,Blur,Suede,Sleeper,The La's,Super Furry Animals\n Band of Horses,Iggy Pop,The Velvet Underground,Radiohead,The Decemberists,Morrissey,Television\n

etc.

Objective is to produce a list of pairs of artists which appear TOGETHER in at least fifty different lists. For example, in the above sample, Radiohead and Morrissey appear together twice, but every other pair appears only once. Your program should output the pair list to stdout in the same form as the input (eg Artist Name 1, Artist Name 2\n).

==========

Java JDK 7 is assumed

To build: javac ArtistPairs.java

To run: java ArtistPairs [input file] [threshold] Output is dumped to STDOUT, To run and dump output to a file: java ArtistPairs [input file] [threshold] > output

ex. java ArtistPairs Artist_lists_small 50

This is a short description of what is going on from the top level:

As we read in the input file we count the number of occurrences of each artist throughout the data set. Since we are looking for pairs of artists that occur more than 50 times, we maintain a list of artists that we have seen more than 50 times, and the users that listed them.

Time and Space complexity are both O(n) for this step, where n is the number of users taken as input

After all users are processed we take the list of artists that have occurred more than 50 times and map users to lists of those artists that they have listed (essentially a pruned version of the input file)

Time and Space complexity are both O(n)

We now go through the artist-users list that we generated in step 1 and for each artist, we take the list of users that have listed that artist in their list, use the map that we generated in step 2 to increment a hashmap that counts each other artist that they have listed in their lists. i.e. the intersection of these sets. If any of these counters exceeds 50, we add the artist to another arraylist that holds these values, and at the end of the looping for each artist we print out the pair.

Time complexity:

O(mnt), where t is the maximum number of artists that a user can list -- which is a constant (50), and m is the number of artists that appear more than the threshold (also 50) times.

this is really just O(m*n). however m is bounded by n, so this is really

O(n^2)

Space complexity:

At each iteration in the artist loop:

O(n) to build up each hashmap, however this will reset every iteration

This gives us an overall time complexity of O(n^2) which scales reasonably well since the actual value of m will be no greater than
n/t. on extremely large data sets however it might be prudent to just do random sampling instead of iterating over

the entire set.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.settings		.settings
bin		bin
src		src
.classpath		.classpath
.project		.project
50.txt		50.txt
Artist_lists_small.txt		Artist_lists_small.txt
README.md		README.md
pairs.txt		pairs.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

aaliang/ArtistPairs

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages