Redtube title analyzer

If you are part of my family, a lecturer or a future employee do not continue to read !!!
The following contents is for amusement purposes only and I am not responsible for anything. Do NOT take serious !!!


So after this warning I can assume you are a freak like me and the people I live with. I live in a geek house with 5 men and obviously we sometimes talk about porn. One of our favorite sites here is redtube.com (setting it as start page and so on)  At some stage we had a discussion on what would be the most used term in the titles of this fine video material. Sitting down I though that my computer can easily find this out for me. So I wrote a little script
 1 #!/usr/bin/ruby
2 require "open-uri"
3 require 'rubygems'
4 require 'hpricot'
5
6 counter = 1
7 begin
8 page = open("http://www.redtube.com/?page=#{counter}")
9 if (page.kind_of? Tempfile)
10 ps = page.read
11 else
12 ps = page.string
13 end
14
15 doc = Hpricot.parse(ps)
16
17 (doc/"a.s").each do |link|
18 link.inner_html.downcase.split.each do |word|
19 puts word
20 end21 end
22 counter=counter +1
23 end while (ps.index('No Videos found') == nil)

This will just scan through all the pages and return the single words of every title and exit if there are no pages left. I know you can optimize this and you could write a shell script to do it but bear with me.  So this returned a list of 39665 words out of 490 pages of titles. But this is not really interesting we want to count the words and here are the top 10 words. The first column is the repetition the second the word.
574 2
596 gets
615 with
620 the
641 fucked
726 girl
762 and
798 her
877 hot
1059 in
Who would have guest hat 'in' would be the most used word and that '2' is so often used. Everyone I asked assumed it would be some rude word.
Here is a little graph of all the words by repetition. 

You can clearly see that there are loads of words that only show up once and then their are a few words that are repeated quite a lot of times. I suppose you can analyze this far more and find out why exactly these words are repeated so many times. 

P.S. if someone can offer me hosting space I am more than happy to publish all the files I just don't want to upload them to google or my uni server as they contain quite rude words ;)

2 comments:

Mex said...

I am not sure if this is disturbing of brilliant, useful work none the less.

Free Porn said...

I never would have guessed "in" would be number 1, and I am very surprised "the" is quite far down the list