University of Notre Dame NetScale Laboratory
My initial task given at the start of the semester was to continue my research from last year and take the files my future script had created and take the 20 or so video codes located in each file and download the videos into a different directory. There were a couple of challenges posed to me as well: first - if we already had the video, do not download it again, and second - once the file was checked move it into a different directory. After some frustrating searches (searching google for perl and downloading is NOT advised) I figured the best way to actually download the files was a nifty little python script called youtube-dl. This script simply took the website with the youtube video as an argument, which is perfect because each website with the video is differnt only by there video code located in each file. So once I actually had the perl script running the pythong script - all I needed to do was parse through the files.

This ended up being a bit trickier than I anticipated. First, I thought just to embed this functionality into my previous script, but soon realized there are a ton of files already downloaded that need to be parsed, so I would have to create a completely new script. The process had to be started by finding a way to go through a directory and look at each file individually. I soon found a find() function which searches through a specified directory recursively and runs a subroutine for every file found. This turned out to work very well with what I needed to do, as long as the directory searched doesn't have other folders in it with other files (to speed up the process). For every file found by the search, I check to see if it is a text file I want to parse by checking the first seven letters against "YouTube". If it is a file I want to parse, I opened the file and pushed all of the video codes to a global array. I then ran a for loop on that array, taking each code and checking it against every file found in the videos directory using another find(). If the file was found, I set a flag to true and if not, did nothing. After the find() I simply ran a little if - else that either went to the next iteration of the for loop if the flag was set or downloaded the video if the flag was not set. I then deleted the for loop (since it is global) so it be used for the next file found needing to be parsed.

Lastly, I needed to move the file to another directory. This turned out to be a bit trickier than I suspected. I originally wanted to use a simple command line mv command flanked by ` ` but it was giving me trouble for some reason. I then tried the move() function, but when I ran that, it would move every file (even the ones that were not "YouTube" files) to the checked_files directory and remove the text_files directory - they place they were orginally stored. I did some playing around and finally figured out that the variable $_ - which is supposed to hold the name of the file found by the find() function - was being lost for some reason. The easiest fix was to simply create a global variable and assign $_ to it right after it was found. This solution worked partially, but the `mv` command still did not like it because our filename included spaces, so after some more digging I found the rename() function which turned out working perfectly. I also added a short sleep after every successful video download, so we do not end up pinging YouTube? TOO often. All that needs to be done it let the script run on the massive collection of files and let the downloading begin (and hope that it does not break)!

At fall break, I had a fully working script which would go out and parse through all of the files collected and download videos that were not already located on a video directory, then lastly moved the text file with the YouTube? data to another directory. Since the script worked well with the test files, I was then tasked with three more jobs: a script which gave the directory data on the directory with all of the videos located in it, creating separate files named after each video title that would contain the the date of all the times it was found in a YouTube? file along with the hits it had at that date, and lastly creating plots of this data by using gnuplot and putting the resulting images in an html file. The first script was very simple and just used very straightforward command from the linux prompt. The second file created was slightly more of a challenge. The file itself is formatted into eight columns: year, month, day, hour, minute, second, time in seconds since January 1, 2000, and hits at that time. The idea is every time the parser detects a certain YouTube? code in the file it is parsing, it opens a file of that name and appends all of this info to the file. So once the parser is done, the a file should exist for each video (based on the code name) and will contain the hits it had and each date found, and a line count will determine how many times that video was found in our files. Lastly, I needed to run all of this data through gnuplot to generate all of the data plots from these files. The easiest way to accomplish this was in the main parse loop. I create a make_images.gnu file and append to it the data needed for gnuplot to run the script. The one downside of doing this is the parser will ultimately append to this file the same YouTubeCode? .dat file to the make_images.gnu file, based on how many times that specific code in found. This is inefficient, but will not affect the images themselves because when the next images is created it simply overwrites the existing image of that name. The make_images.gnu file itself it formatted as follows:

set xlabel "Time since 2000 in seconds"
set ylabel "Number of hits"
set output "png/EQbbfP1RBfQ.png"
plot 'EQbbfP1RBfQ.dat' using 7:8 t "Time vs. Hits" with linesp

with this block being repeated for every code the parser finds. Then, after the main parsing loops has completed, the script calls the gnuplot command with the make_images.gnu file as input, which creates all of the graphs and puts them in the png/ directory nested within the text_files/ directory. Lastly, the script creates a html file called images.html within the png/ directory which simply makes a page with a table four columns wide with all of the graphs' thumbnails along with the code of the movie it represents. This concluded the tasks I completed after fall break.

As of right now, there are just two problems I simply could not figure out and resolve. First, the images of the graphs created by gnuplot simply become corrupted right after gnuplot creates them and cannot be viewed. This is very puzzling because I simply used the format of the example provided to me by Professor Striegel, but my graphs would continue not to able to be viewed. Secondly, gnuplot needs to be run on a machine with some graphical capabilities (because they create their graphs in a gui), so my main script cannot be run just on a simple box like joemuffaw. Besides these two problems, the scripts has been run and tested multiple times and performs everything else correctly.

  Attachment Action Size Date Who Comment
txt download_video.pl.txt props, move 2.8 K 18 Oct 2007 - 19:23 JoeHof parser for the YouTube.txt files which downloads videos
txt download_video_final.pl.txt props, move 6.3 K 22 Dec 2007 - 21:56 JoeHof final parser for the YouTube files
else sizescript props, move 0.1 K 22 Dec 2007 - 21:50 JoeHof sizescript for the video directory
r4 - 22 Dec 2007 - 21:56:51 - JoeHof
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Syndicate this site RSSATOM