Research Reflection - Inferring YouTube Video Popularity
Students: Michael Moriarity, Joe Hof
Research Reflection
This semester, I performed research under Dr. Striegel alongside fellow undergraduate computer science major Joe Hof. Joe and I decided to pursue Dr. Striegel's developing research involving video websites such as
YouTube. The research question posed was whether or not there is something in common with the different videos posted on these websites, and if so, how can the video data be cached so as to free up bandwidth usage. Social bookmarking websites such as Digg and Slashdot often link to a
YouTube video, and if the video is posted on a website which has a certain bandwidth limit, then a video making the front page of Digg or Slashdot is certain to use up all of a site's bandwidth, forcing the smaller website to shut down. Dr. Striegel's research aims to find a way to prevent such events from happening using packet-caching and other similar techniques.
Our first task as designated by Dr. Striegel was to develop a Perl script which grabbed data from
YouTube. The data grabbed from
YouTube should be:
- The video's name.
- The number of views.
- The URL of the video.
- The length of the video.
Joe and I decided it would be most beneficial to the learning process if we both sat down and programmed on one computer, rather than both programming the same program separately. We sat down with a few books from the library and began to write out first Perl script.
Our first goal was to figure out how to grab the source code from
YouTube itself. We first tried the "wget" function, but it didn't like the "&" in the URL. We then tried the simpler "get" function which worked wonderfully. We now had the source code for
YouTube? 's top videos of the day in an array. Now we worked on writing the array to a file. This is where the code gets a little sloppy. We tried just using one file (test.txt) for all of our file functions, but Perl didn't like the way we were doing things so we had to resort to opening, closing, and reopening files in order to get the functionality that we desired. We then wrote the source code to "test.txt" for further processing.
The next step would be to extract from "test.txt" only the lines of source code which contained data useful to us. To do this, we opened a new file titled "file.txt" and wrote to it the lines of "test.txt" which contained the words "vtitle", "Views:", and "runtime". In hindsight, we could have named our files a little bit better, but it's not crucial. Anyways, we then had the lines of code containing the desired information in "test.txt".
Next it was time to perform surgery and extract all the data that we needed from these 60 lines of code. We needed to learn
RegEx? . By looking through books as well as examples from the internet, we pieced together some regular expressions which extracted the desired data from "test.txt" and placed it in the appropriate data arrays. Here are the
RegExpressions? :
push(@vid_links,$1) while /([=][0-9a-zA-Z\-\_]{11,})/g;
push(@vid_names,$1) while /([>][\w\s\W]{1,}[<])/g;
Basically, what we did to create these lines is look at the source code and identify unique separators which separate the data that we want from the rest of the line. For example, the video URL was between an equal sign and was always greater than 11 characters long. Also, the video's name, views, and runtime was between the arrow brackets and was of variable length.
We then had to grab the views, the name, and the runtime of each video and place it in an array for storage. We decided to do this using a simple counter. We incremented the counter every time we added something to the array. What this did was it placed the video's name in the array, and then the runtime, and then the number of views. We just took it out of the array in that order and labeled it. We now had all the data we needed in 2 arrays: vid_links and vid_goodies. At first we wrote the data to a statically named file, but then realized this wouldn't work for data collection as our program overwrote the file every time it grabbed the information. We then decided on a naming convention for the file to write the information to: ""Youtube - $year-$mon-$mday $hour:$min:$sec", where year, mon, etc. is the current year, month, time. This would create a file everytime data was collected and label is as to when the collection occurred.
This was our progress up to spring break. The next addition to the script involved seeing how many sites link to the video in order to discover a trend among views on
YouTube and links from other sites. We first tried to grab Google's source code using the "get" function, but Google didn't allow us to do that. The next alternative was to use another search engine such as Yahoo. We discovered that the way to perform a search on a
YouTube video was the input the URL into the Yahoo search URL. For example, if we wanted to search for the video with the URL ending in "=***********", we would get the source code for the URL:
http://search.yahoo.com/search?p=v%3D***********&fr=yfp-t-501&toggle=1&cop=mss&ei=UTF-8
This method worked quite well, and we could grab the source code for all the data. All we did was loop 20 times and grab the data for all 20 videos and then store that data in a file "test2.txt". We then extracted the hits on Yahoo from the source code for each video using the
RegEx? :
push(@vid_results,$1) while /([t][0-9\,\s]{1,}[f])/g;
This
RegEx? isn't exactly what we wanted, however, because in addition to the number of search results, it grabs a letter before the number and after the number (eg. "t 760 f"). We decided this wasn't a big deal, however, because the data is definitely still there. Next, we wrote all of this data to a dynamically named file based on the system time ("Yahoo Results - $year-$mon-$mday $hour:$min:$sec"). So now we have two files for every execution of the program. The program loops once every hour, which means it will grab data from
YouTube (which doesn't update on a regular basis) every hour, as well as search Yahoo for data on the videos it retrieves. One of the other undergraduate students used our data in his programming of a C-program which computes each of two video's Rabin fingerprint and then calculates how many bits the two videos have in common.
If we were to continue this research over the summer, I feel like the next step would be to gather more data from more search engines and then somehow graphically represent the data we collect rather than just use text files. This wouldn't be very difficult, in my opinion, and would help the research along by gathering more data and making it more readable.
Overall I feel as though the semester was very rewarding and I learned a whole lot about scripting in Perl and regular expressions, as well as some interesting insight into
YouTube trends. I feel like the knowledge I gained through performing this research will help me out a lot in my future classes. I sincerely thank Dr. Striegel for guiding us and giving us the opportunity to study under his tutelage.