Files in the top-level directory from the latest check-in
- jobhunt
- plugins
- samples
- utils
- README.md
- jobhunt.conf.example
- jobhunt.py
- schools.md
- websites.md
Jobhunt
Jobhunt is a simple Python script that will query sites for employers you are interested in and collect the results, emailing them to you. You will almost certainly have to write some plugins to pull the information off the sites you are interested in. (Plugins currently exist for job-listing websites of community colleges in CA and OR.)
Installation
Jobhunt has only two dependencies, Requests (>= 2.6.2) which can be installed with
pip install --upgrade requests
and BeautifulSoup:
pip install beautifulsoup4
TODO List
Distinguish part-time from full-time positions.
Attach location (in config) to URLs, so that positions can be sorted by location. Unfortunately, "location" is a common field name in job listings, so we'll have to use something else.
Usage
Edit the configuration file jobhunt.conf
with the list of plugins, URLs,
and your email account details.
Run ./jobhunt.py
from cron. To test, you can run
./jobhunt.py --test
If you want to test a specific plugin or URL, you can do
./jobhunt.py --plugin pluginname
./jobhunt.py --plugin pluginname --url http://example.com
Note that for offline testing of plugins, the URL can start with the file:/// scheme.
If you want to just test the email configuration by sending a test message, do
./jobhunt.py --mailtest
The full set of command-line options are
--conf path Specify location of the config file
--data path Specify location of the persistent data file
(defaults to same directory as config file)
--test Run, but output to console and don't email
--testhtml Run, but output HTML to console instead of
plain text.
--ignorekw Ignore any keywords (return all results)
--ignoredb Don't read/write to the database. All results
will be returned as "new".
--forcedb Use the database, even when it would not other-
wise be used (i.e., in test mode)
--plugin pluginname Run only those entries that use a specific
plugin.
--url http://whatever Run only a specific URL's entry (from config)
--mailtest Send a test email, to test mail settings
Configuration File
The Jobhunt configuration file is a JSON object with the following fields:
mail
-- Describes the settings for your email server, in order to send you mail. Should contain the following fields:server
(required) -- Hostname of the mail serverport
(optional) -- Port to use to connect. Defaults to 25 or 465 (SSL)secure
(optional) -- Whether to use SSL to connect (defaults to false)username
(required) -- Username (in the formuser@host.com
) to login as.password
(required) -- Password to login withrecipient
(required) -- Email address to send the report to.
keywords
-- An object mapping field names (usually 'title' or 'description') to sets of keywords to search for in that field. See the section below on Keyword Matching for details.queries
-- An object mapping query names (displayed in the report) to query objects. A query object can have the following fields:url
(required): Specifies the URL to be queried.plugin
(optional): Specifies the name of the plugin to be used. May be omitted if it can be automatically determined from the URL.options
(optional): Specifies an object with any other options required by the plugin.keywords
(optional): If present, completely overrides the global keywords for this query.- other fields (optional): Will be added to the results of this query, allowing you attach other identifying information to the query results.
Keyword Matching
In the configuration file, each keyword has a field name as its key, and a set of keyword criteria as its value. The special field name "*" can be used to match on any field.
A keyword criteria object has the form:
{ 'all' : [keywords...], 'any' : [keywords_or_lists...] }
Both fields are optional, and are treated as [] if they are missing.
'all' gives a list of keywords which must all be present in the field in order for this match to succeed. If any of these is missing, the match fails.
'any' gives a list of keywords, or sublists of keywords, at least one of which must be present for the match to succeed. E.g.,
'any' : ['business', 'pleasure']
indicates that either 'business' or 'pleasure' (or both) must occur in the field, while
'any' : ['business', ['funny', 'serious']]
indicates that the field text must contain either 'business' or both of 'funny' and 'serious'.
The matching procedure requires at least one field to match in order for the whole match to succeed. The special '*' field does not count, as it must match in any circumstance.
Thus, the way to setup keywords is:
Add keywords that you want to find in any field to "". Note that using 'all' means that all the keywords given must occur in at least one field, whereas using 'any' allows keywords to occur in different fields. (Using 'all' with '' does not mean that every field must have every keyword.)
If you know that a specific field, e.g., 'title' will contain a keyword of interest, add a criteria for it. If you want to require a set of keywords to be present in that field, use 'all'. If you want to require at least one keyword (or collection of keywords) to be present, use 'any'.
For a concrete example, here is a keyword-set that requires all matches to mention the keyword 'faculty' somewhere, and then requires the title to mention either 'programming', 'computer' and 'science', or 'information' and 'systems':
"keywords" : {
"*" : { "all" : ["faculty"] },
"title" : { "any" : [
"programming",
["computer", "science"],
["information", "systems"]
]}
}
Plugins
A plugin is just a module which exports a function query(url,keywords)
. You
can import the jobhunt
module for some convenience functions (in particular,
getURL
will download the URL for you, taking care of generating an error report
if the status isn't OK). You'll probably want to use BeautifulSoup to parse the
HTML. You should try to do three things:
Validate the HTML against what you expect it to be. If the target page's format has changed, you want to detect that and return a
ParseFailed
object, instead of just returning bad data.Parse out the job title(s) and description(s). Check them against the keywords given. You can do this easily by just calling
Jobhunt.kwMatches(keywords, text)
. If a page has more than one job listing, check them all and collect the results into a list. Return the list from yourquery
function.Return a either a single Match object or a list of such objects. The Match constructor takes
Match(title,description)
You can then use the resulting Match object's
.fields
to add any other data you've collected from the listing. You can use the.addFields()
method to add a whole dict worth of fields at once.Try to make sure every bit of text your plugin returns is unicode, UTF-8. The system will try to clean up after you if you don't, but that doesn't always work.
Note that plugins for specific sites are free to ignore the URL. This is particularly useful for sites that require you to request multiple pages (e.g., where the job listing results are divided into pages) or that require you to POST some data in order to get the actual results. The URL is just for convenience, in the case that multiple sites happen to use the same job listing back-end. (In this case, even if extra work is required to retrieve the results, you should still try to use the base part of the given URL, rather than hard- code your plugin to a particular site.) Note, however, that if you don't use getURL to download the page, you have to check yourself for 404 or other HTTP errors, and throw a NotFound exception if one occurs.
A note on Unicode: If you're going to use BeautifulSoup to parse the results
of a request, you should pass the .content
of the response, and not
the .text
to BeautifulSoup. BeautifulSoup will do a better job of detecting
the encoding, since it can look at the meta-encoding tag. You should also use
BeautifulSoup's .get_text()
method to extract snippets of the HTML, since
it will return a unicode string with the correct encoding.
A Note About Parsers
BeautifulSoup will use the "best" available parser, if none is specified. This means that the results of the various plugins may depend on which parsers you have installed. To the best of my knowledge, only one plugin (peopleadmin) strictly depends on a particular parser (lxml), and it requests it explicitly, so it will fail if you don't have it installed.
Still, if you write your own plugins, this is something to think about. You might want to test your plugin with the various parser options, to see if it will give the same results in all cases.