Content index generator
Some algorithms and metrics use, as input, an index containing the information about the user-generated contents associated to each user. In order to generate these indexes, we provide two programs, taking advantage of the Lucene library (https://lucene.apache.org/). We detail here how to execute these programs.
User index generator
This is the Lucene index used by the Twittomender algorithm (and the novelty and diversity metrics). Each user has associated a single document, built by concatenating the information pieces published either by the user, or by her neighbors. In order to build this index, we have to execute the following program:
java -jar RELISON.jar index user graph multigraph directed weighted selfloops information-pieces header orientation index-route
where
graph
: a file containing the social network.multigraph
: true if the network allows multiple edges between each pair of users, false otherwise.directed
: true if the network is directed, false otherwise.weighted
: true if we want to use the weights of the links, false otherwise (weights will be binary).selfloops
: true if we allow links between a node and itself, false otherwise.information-pieces
: a file containing the information pieces (See Information pieces file below).header
: true if the file contains a header, false otherwise.orientation
: *own
: uses the pieces created by each user as their representation. *IN
: uses the pieces created by the incoming neighbors of the user as her representation. *OUT
: uses the pieces created by the outgoing neighbors of the user as her representation. *UND
: uses the pieces created by both the outgoing and incoming neighbors of the user as her representation. *MUTUAL
: uses the pieces created by the mutual neighbors of the user as her representation.output
: a directory in which to store the index.
Information pieces index generator
This is the Lucene index used by the Centroid CB algorithm. An user-generated content is related to each document in the index. A relation between information pieces and creators is also stored.
java -jar RELISON.jar index infopiece graph multigraph directed weighted selfloops information-pieces header orientation index-route
where
information-pieces
: a file containing the information pieces (See Information pieces file below).header
: true if the file contains a header, false otherwise.output
: a directory in which to store the index.
Information pieces file
The information pieces (individual user-generated contents) file needs to have the following format (CSV divided by tabs):
infoId userId text reprCount likeCount created truncated
where
infoId
: identifier of the information piece.userId
: identifier of the creator.text
: the content of the information piece.reprCount
: number of times the piece has been repropagated.likeCount
: number of likes the piece has been received.created
: UNIX timestamp indicating the date of creation.truncated
: whether we are taking the complete text, or just a small part.
The text must be in UTF-8 format, and user-generated contents are separated by line skips. Fields (like text) which might have tabs or line skips inside must be properly escaped, and surrounded by “”.