For our machine learning project, we attempted to automatically guess ratings or labels for Slashdot comments based on their content. As a side effect, we generated some data on what words and phrases tend to appear disproportionately often in high-ranked (low-ranked, interesting, uninteresting, funny, unfunny, etc.) comments.
The set of the top 40 "Funny" phrases turns out to be a hodgepodge of cultural references. I am not sure I understand all of them.
1 xkcd.com $ 2 xkcd.com 3 nukem forever 4 carrier $ 5 slashdot editor 6 skynet 7 clod $ 8 woman $ 9 grue 10 newt 11 no carrier 12 asparagus 13 nigerian prince 14 porn with 15 grue $ 16 an outrage 17 kentucky $ 18 eight camera 19 reality distortion 20 god what 21 six video 22 electronic games 23 locally $ 24 paperbacks 25 distortion field 26 its belly 27 my underwear 28 am intrigued 29 penny-arcade.com $ 30 priceless $ 31 lycra 32 emacs $ 33 polar bear 34 cried out 35 burma shave 36 an african 37 porn for 38 your grip 39 expects the 40 not talk
("$" means end of comment; "^" means beginning of comment.)
The list of top "Interesting" phrases suggests that workplace stories are interesting:
employees were; what worked; department i; our clients; wap; reviews on; file servers; work etc; could connect; stance that; updates the; those available; hitting my; europe to; i'm seeing; happening with; snuff; time anyone; spam has; to snuff; the bases; thin and; my college; street to; extreme programming; be neutral; late 19th; management they; from game; tenacity; withstanding; own account; right beside; magpies; from intel's; my food; obscure stuff; language when; and trash; been dragging
Meanwhile, the phrases least likely to be found in "Interesting" comments are either insulting or profane:
^ no; insensitive; again $; you insensitive; ^ oh; clod; insensitive clod; ^ you; ^ just; ^ then; ^ well; the hell; ^ or; slashdot; ^ and; you $; post $; ^ yes; ^ why; ^ but; ^ yeah; um; you mean; ^ they; wikipedia.org $; then $; religious; ^ now; clod $; mod; is called; ^ not; right $; ^ he; ^ ah; first post; ^ is; ^ your; ^ it's; fuck
These lists were generated using a corpus of 55,561 comments posted between June and November 2008.
What a wonderful study!
ReplyDelete"Its belly"?