Homework 0

Handed out: September 29, 2011
Due: October 4-ish, 2011

Description

In this assignment, you will be hand-normalizing a set of Twitter messages from their "raw" form into a more "formal" or "readable" form. The purpose is threefold:
  1. To familiarize yourself with the Twitter data that we will be working with;
  2. To familiarize yourself with some of the types of normalizations that these data might require;
  3. To help collect human annotations to serve as a gold standard data
The Twitter data itself can be downloaded here: http://skynet.ohsu.edu/~bedrick/bmi506/

The files themselves are UTF-8 formatted, and that they should stay that way. :-) If you are unsure about what this means, consult "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets".

The files are formatted thusly:

id \t text

I'd like the file you turn in to be in the following format:

id \t orig_text
id \t expanded_text
\n
For example:
57615216283881472       @AshhHanner wut are ur interests?
57615216283881472       @AshhHanner What are your interests?
\n

That trailing newline just means that each original/expanded pair should be separated from its neighbors by a blank line.

User-mentions (e.g. "@AshhHanner" in the previous example) should be left intact, as should URLs. Emoticons should be replaced with bracketed descriptions, e.g.:

@drakkardnoir I SIMPLY BELIEVE THAT I STARTED WRITING AND RAPPING BECAUSE OF YOU. :) i love you <3
@drakkardnoir I simply believe that i started writing and rapping because of you. [smile] I love you [heart]

If you're unsure as to the meaning of an emoticon, just put in "[emoticon]"

Numbers should be expanded to number names, e.g.:

Closed a SELL GBP/USD position at 1.625 on ZuluTrade.com. net PnL: -$.76 Visit http://bit.ly/fmc1fT  to see my performance.
Closed a sell pounds sterling/US dollar position at one point six two five on ZuluTrade dot com. net profit and loss: minus  point seven six dollars Visit http://bit.ly/fmc1fT  to see my performance.

Repeated letters ("flowwww") should be normalized to their non-repeated form ("flow"). If you're unsure about a particular case, though, just leave it alone.

Comments or other annotations should be on their own line, preceded by a "#":

57615216283881472       @AshhHanner wut are ur interests?
57615216283881472       @AshhHanner What are your interests?
# this person needs to learn how to spell!
\n

One example of a time to include this sort of comment line would be if you encounter something that you think is an abbreviation, but for which you are unable to find a definition.

If you have any questions, please don't hesitate to send them my way- a big part of the purpose of this assignment is to characterize the sorts of issues that come up when doing this sort of manual annotation with Twitter data.

Also, as we talked about in class, if there are any specific messages that you don't want to deal with (due to subject matter, etc.), feel free to skip them and go on to the next one.

As far as set assignment: here's what Emily had written down from this morning's class. Remember, each person should expand all of the messages in their set independently- the idea is to end up with duplicate normalizations for each message set. So, for example, Eric and Meysam should each go through the messages in set #1, and each should turn in their own file containing normalized versions of the set's messages.

Set 1: Eric and Meysam
Set 2: Tomer and Travis
Set 3: Andrew and Geza
Set 4: Maider and Emily
Set 5: Mala and Reese

Finally, to turn in the assignment, just send me an email with the file itself (named your_name_set_number.txt, e.g. emily_set_4.txt) as well as your observations about the types of non-standard words that you encountered while working with this data set. If you encounter types of NSWs that were not listed in the typology in Richard's paper, please make sure to mention that.

Thanks, and, as I said, don't hesitate to ask if you run into any questions or problems.

Please email your final file and comments to Steven Bedrick at bedricks@ohsu.edu.