Top-level Files of tip
Not logged in

Files in the top-level directory from the latest check-in


Jobhunt

Jobhunt is a simple Python script that will query sites for employers you are interested in and collect the results, emailing them to you. You will almost certainly have to write some plugins to pull the information off the sites you are interested in. (Plugins currently exist for job-listing websites of community colleges in CA and OR.)

Installation

Jobhunt has only two dependencies, Requests (>= 2.6.2) which can be installed with

pip install --upgrade requests

and BeautifulSoup:

pip install beautifulsoup4

TODO List

Usage

Edit the configuration file jobhunt.conf with the list of plugins, URLs, and your email account details.

Run ./jobhunt.py from cron. To test, you can run

./jobhunt.py --test

If you want to test a specific plugin or URL, you can do

./jobhunt.py --plugin pluginname
./jobhunt.py --plugin pluginname --url http://example.com

Note that for offline testing of plugins, the URL can start with the file:/// scheme.

If you want to just test the email configuration by sending a test message, do

./jobhunt.py --mailtest

The full set of command-line options are

--conf path                 Specify location of the config file

--data path                 Specify location of the persistent data file
                            (defaults to same directory as config file)

--test                      Run, but output to console and don't email

--testhtml                  Run, but output HTML to console instead of 
                            plain text.

--ignorekw                  Ignore any keywords (return all results)

--ignoredb                  Don't read/write to the database. All results
                            will be returned as "new".

--forcedb                   Use the database, even when it would not other-
                            wise be used (i.e., in test mode)

--plugin pluginname         Run only those entries that use a specific
                            plugin.

--url http://whatever       Run only a specific URL's entry (from config)

--mailtest                  Send a test email, to test mail settings

Configuration File

The Jobhunt configuration file is a JSON object with the following fields:

Keyword Matching

In the configuration file, each keyword has a field name as its key, and a set of keyword criteria as its value. The special field name "*" can be used to match on any field.

A keyword criteria object has the form:

{ 'all' : [keywords...], 'any' : [keywords_or_lists...] }

Both fields are optional, and are treated as [] if they are missing.

The matching procedure requires at least one field to match in order for the whole match to succeed. The special '*' field does not count, as it must match in any circumstance.

Thus, the way to setup keywords is:

For a concrete example, here is a keyword-set that requires all matches to mention the keyword 'faculty' somewhere, and then requires the title to mention either 'programming', 'computer' and 'science', or 'information' and 'systems':

"keywords" : {
    "*" : { "all" : ["faculty"] },
    "title" : { "any" : [
        "programming",
        ["computer", "science"],
        ["information", "systems"]
    ]}
}

Plugins

A plugin is just a module which exports a function query(url,keywords). You can import the jobhunt module for some convenience functions (in particular, getURL will download the URL for you, taking care of generating an error report if the status isn't OK). You'll probably want to use BeautifulSoup to parse the HTML. You should try to do three things:

Note that plugins for specific sites are free to ignore the URL. This is particularly useful for sites that require you to request multiple pages (e.g., where the job listing results are divided into pages) or that require you to POST some data in order to get the actual results. The URL is just for convenience, in the case that multiple sites happen to use the same job listing back-end. (In this case, even if extra work is required to retrieve the results, you should still try to use the base part of the given URL, rather than hard- code your plugin to a particular site.) Note, however, that if you don't use getURL to download the page, you have to check yourself for 404 or other HTTP errors, and throw a NotFound exception if one occurs.

A note on Unicode: If you're going to use BeautifulSoup to parse the results of a request, you should pass the .content of the response, and not the .text to BeautifulSoup. BeautifulSoup will do a better job of detecting the encoding, since it can look at the meta-encoding tag. You should also use BeautifulSoup's .get_text() method to extract snippets of the HTML, since it will return a unicode string with the correct encoding.

A Note About Parsers

BeautifulSoup will use the "best" available parser, if none is specified. This means that the results of the various plugins may depend on which parsers you have installed. To the best of my knowledge, only one plugin (peopleadmin) strictly depends on a particular parser (lxml), and it requests it explicitly, so it will fail if you don't have it installed.

Still, if you write your own plugins, this is something to think about. You might want to test your plugin with the various parser options, to see if it will give the same results in all cases.