Extracting Data from Web Pages with FinpMod

FinpMod is a perl module I use to help with scripts that access a variety of web sites from which I want to retrieve numeric information. In most cases the sites require a login with at least a username and password. The html code is laced with pretty pictures and advertising that gets repetitive requiring considerable mousing around when I desire the results every day or every week. I now write a script and set it up as a cron job so that it runs at night and provides tab separated text files that open easily in a spreadsheet, MS Excel on Mac OS 9 for me.

The login sequence for the sites is typically set up with privately-coded procedures that can be strange indeed. Basic HTTP login is never used. Rather some combination of HTTPS POST requests and cookies are used. Often a request results in a relocation while simultaneously delivering a cookie that is easily missed. The FinpMod module allows for a "debugging" mode that helps by saving all downloaded files and headers to temporary files as opposed to simply reading them into RAM. While scripting the HTTP access it helps to be able to open and observe the intermediate files with a text editor.

The Live HTTP headers add-in for Firefox or the logfile from the iCab browser for Macintosh are typically used as a starting point for writing a script in perl. The script then makes system calls executing curl to access the internet. It would be more efficient to make calls to the curl library from a compiled source but the all-scripting approach makes development easier and simplifies required changes when the targeted web site is changed by others.

FinpMod does not help you to access sites for which you are not an authorized customer. There is no password cracking. You need to have an assigned username, password, and whatever else you need from another source.

FinpMod has been developed on a Macintosh using Mac OS 10.3.9. I use bbedit from Bare Bones software as an editor and as a worksheet from which I can execute shell commands. Some of the debugging procedures depend on bbedit but they should be easily changed to fit your own development environment. In principle all you need is a terminal window and a text editor but an editor that can execute shell commands is nice. The tested and working scripts are regularly moved over onto a Linux box for execution under cron.

Your scripts need to find FinpMod.pm and that can be painful. If you need two copies of curl or perl, one for your system which may not be current, and another for FinpMod it's not clear that you can put FinpMod in the standard library places. I define the environment variable PERL5LIB and point it to a directory where I store my personal modules. A line like:

PERL5LIB=$HOME/perl

can be included in a crontab to establish the same linkage and make the perl command "use FinpMod" work in a script.

Links

curl, A tool for accessing HPPT, FTP and other protocols.

perl, A scripting language.

Programming Perl. the Camel book.

Format for the .netrc file

Format for a MIME-compliant TSV file

Live HTTP headers for the FireFox browser

The iCab browser for the MacIntosh

Bare Bones Software and the BBEdit worksheet editor for Macintosh

Download the FinpMod.pm module

FTP directory for example code and files

About TSV files

Tab separated values files are widely supported text-only files that spreadsheets handle well. A single tab character is used to declare a transition horizontally between columns and a line end terminates a row. Files should end with a line end character. Line ends should be in the style used by the machine on which the file is stored but most current software doesn't care. Line ends and tabs are forbidden inside of a cell of data. Some software attempts to allow such things by quoting the contents of a cell but FinpMod doesn't support that.

TSV allows for a first row that contains column titles which describe the data below. FinpMod uses titles as a means of communication between client scripts and the FinpMod.pm module. Clients need to know what the titles are to be able to change or create data to be saved as a result. Titles also can have shell-like options consisting of a dash (-) and a single letter. They denote particular columns as special to be used for internal generatin of hash keys, sorting, and looking fop long titles associated with web pages. In FinpMod titles are limited to alphabetic characters and the underline (_) with no spaces or numbers. The column cardinal (zero based) can be accessed by a client using a $ sign followed by the text of the column title with any -option removed.

FinpMod uses TSV files for data input, data output and some control. When you have a list of ticker symbols or fund descriptions you want to check on you include the information in a TSV file. When your client script recovers the data it puts it into that same TSV file which is then read by, probably, your spreadsheet. If upi are gettting all of the data from a site you still may want a TSV file for input just to declare the column titles to which answers are to be directed.

About the .netrc file

# machine   host.domain.com   login   myself   password   secret   account   string
machine   pair.com   login   dan   password dan_dan   account   1234
machine   vanguard.com   login   dan   password   myvan   account   tutankhamen	
machine   198.66.55.53   login   srtinc   password   ABC123XYZ
machine   srt-inc.com   login   srtinc   password   ABC123XYZ
machine   Saturn   login   Earth   password   Earth

You really should make sure the permissions are set to 600. That's read and write by owner only.

The Application Programming Interface

sub FMinitialize ()

We call FMinitialize to prepare FinpMod to accept later calls. It checks to see what machine it is running on and sets four directory variables for storage of temporary and permanent results. It also sets up a path to curl which need not be to the system's version. It is possible to have initialization code executed by simply being in the module. We use specific initialization call because it allows for one perl script to execute other scripts while retaining the ability to execute those independently. More when you look at some examples.

FMinitialize also sets up a log file to which you can write. The first argument in the execution string is checked at compile time. Debugging preferences are entered there.

sub FMdprint ($;$)

FMdprint expects a string as the first argument. A line feed is added and the string is printed to standard output of the debug option is set. Verbosity can be controlled using 0, 1, or 2 as an optional second argument. It is compared with the $FMdebug global which is initially set by analysis of the first command line argument.

sub FMreport ($)

FMreport expects a string to which a line end is added and printed to the log file. If debugging is enables=d the string goes to standard out also.

sub FMnetrc ($)

FMnetrc is a function that accepts a "machine name" and returns a list of strings (username, password, account) which it recovers from a .netrc file (note the leading dot) which needs to be in your home directory. The file format is the same as that used for FTP and by curl. In fact the file is shared if those are in use though curl, as called by FinpMod does not use it. The .netrc file should be readable only by the owner but no check is made for that. The machine name can be pretty much anything other than white space characters. Use of .netrc makes it possible for me to include sample code that will not contain my private information even by accident.

sub FMquit ()

Call FMquit to close out the log file and generally clean up operations. It exits to the calling shell.

sub FMnewtemps ()

FMnewtemps is called with no arguments and returns a two element list containing paths to temporary files which will be used for output from the next requests to the web. First is for the HTTP, probably HTML, text returned. The second is for the headers. The globals $FMpageprefix and $FMheaderprefix are used to provide some control of the names. Each call applies an increasing integer to the prefixes.

sub FMcookieproc ()

FMcookieproc maintains a hash of cookies by examining the header file created by the last call to curl. Actually curl has its own cookie handler but it does not provide for adding cookies when you need to recover them from something other than a set cookie in a returned header. FMcookieproc also notices any location change in a header and stores it in global $FMlocation. Sometimes curl's -L option to follow relocations isn't quite good enough to handle strange security measures.

sub FMexcurl ($)

FMexcurl accepts a string argument which contains the guts of a command line to curl but several things are added before the call is made. In particular, the cookie hash is processed and added, destination files for the html and header output are assigned, curl is asked to deliver the size of the download, and error messages are discarded. It is possible to add perl options that don't change by setting the global $FMcurlextras. In many cases the command line requires quoted arguments that need to be escaped in the call to FMexcurl. See some examples.

FMexcurl returns a two element list containing the size of the download and the exit status from curl as numbers. Comparing the size of the download from the value expected allows for early termination of a login sequence when things just aren't working. FMexcurl makes several calls to FMdprint which are pretty much required while developing a script.

sub FMsettsv ($;$)

FMsettsv requires a first argument that gives the name of a tab separated file which is always for output but may be for input also. Directories are prepended for the machine in use. If a second argument is present it is for a backup of the file in the first argument. The actual backup file is created on the next call to FMinput.

A call to FMsettsv assumes that a new web site is next and the cookie hash and some other things are reset.

sub FMsettmp ($$)

FMsettmp accepts two arguments which are the prefixes to be applied to temporary file names. It assumes that a new web site will be next and initializes a few things based on that.

sub FMinput (;$)

FMinput reads the TSV file set up by FMsettsv into a hash $FMtickers. It creates unique keys by concatenating values in two specified columns which are identified in the TSV file by -t and -a options. The -a column is first if it exists and is followed by an underline character before appending the -t (for ticker) column. Globals $FMtickercolumn and $FMacctcolumn contain the column numbers. It turns out that, for my work, I need to allow for multiple occurrences of the same ticker symbol in the first column. Prepending an account identifier with an intervening underline character allows for unique keys and scripts that access multiple sites with the same TSV file.

FMinput() creates a hash of titles which can be used to stuff values into the ticker hash using column names. (see FMsetval). For now the titles are restricted to upper or lower case alphabetics and the underline with no white space. FMinput() also defines variables in main:: which can be used in the form $title so that direct entries like these:

$FMtickers{$akey}[$title] = $value;

$value = $FMtickers{$akey}[$title];

are possible There is some magic involved and scripts that use those variables will require some relaxation of the "use strict" pragma. Also, to protect perl's $a and $b specials. titles must be at least two characters. Titles in the TSV file are restricted to upper and lower case alphabetics with no spaces or numbers. FMinput() uses those restrictions to identify a title row.

sub FMsetval ($$;$)

FMsetval is one way to set values in the %FMtickers hash that will become the output to the TSV file. The first argument is a string that represents the column, by name, in the TSV file. Note that the first row of a TSV file is typically just titles and that FMinput() processes those names for use this way. The second argument is the value to be stored, string or numeric. The third argument is sticky and is the hash key into the row being worked on. If absent the last value found is used.

sub FMputtsv (;\@)

FMputtsv writes the ticker hash to the output file which will overwrite the input. If an argument is present it is a reference to a list of keys into %FMtickers. About the only reason for that is to sort by something other than the keys themselves. The function prototype allows a call like FMputtsv(@newkeys);.

sub FMpublish ()

FMpublish uses a call to curl to send the modified TSV file off to a server. You'll probably have to modify it because I doubt if you're using an old Mac SE/30 for your server. But then you don't need to call it at all if the running machine is the server.

sub FMlookfor($)

FMlookfor takes an argument which is an account identifier as a string. It creates a hash of things to look for in a web page. Quite often some web site refuses to use stock-like ticker symbols but rather uses complicated names. The hash created, %FMlookhash, is indexed by the long names and contains keys into %FMtickers as values. The items to be looked for come from long names in the input TSV file

As of today there is a problem. The column numbers in the TSV file are hard coded and I'm working on that. I would really like to use a fuzzy comparison of long names to account for sloppy abbreviations often found in web pages. See some examples for usage of this hash while examining a web site for references.

sub FMignore($)

FMignore returns true (1) if there are no command line arguments other than a -options as the first. Other arguments are account identifiers. If any are present FMignore returns false (0) unless its argument is present. The usage is usually for debugging but can be helpful for executing from a cron entry that does different things at different times with the same script.

sub FMread ($\$)

FMread is a service routine to read an entire web page from a temporary file into memory. The first argument is the path to the file to be read. The second is a reference to the place it is to be stored. Just use a $string though not \$string in your call. The return is the size of the file or zero in case of an error. The first argument is usually returned from FMnewtemps in an earlier call.

sub FMtabler (\$)

FMtabler is a service to convert a string that is a full downloaded html document into just its tables. The argument is a reference to a string being worked on. Just $string though, not \$string in your call. A bunch of regular expressions are applied to remove useless junk and leave just the parts delimited by and items. A call to FMobserver is included so that, while debugging, you can open a file in an editor to see what happened.

sub FMobserver (\$)

FMobserver prints a string to a temporary file and an appropriate command for viewing by an editor to standard out. It is useful while scripting extraction of table items for inclusion in the TSV file.

sub FMrowcol ($\@)

FMrowcol pretty-prints rows and columns to standard out for debugging. See some examples for its use.

sub FMtimelocal (@)

FMtimelocal returns an epoch time in seconds from an input list in the form used by localtime() (Camel book page 738). It's a little different from timelocal() in the CPAN time module. Here it is useful for such things as telling your bank the time at which you want transaction results to begin. Subtract seconds in a month from now for instance.It's possible for something to go wrong in which case a negative value is returned.

Contact

Kudos, suggestions, bug reports, and sample client scripts are encouraged. Please clean out any usernames and passwords.

Douglas P. McNutt, PhD
The MacNauchtan Laboratory
7255 Suntide Place
Colorado Springs, CO 80919-1060
voice 719 593 8192
dmcnutt@macnauchtan.com  -  Please put FinpMod somewhere in the subject line.
http://www.macnauchtan.com/
ftp://ftp.macnauchtan.com/