Module Scraper::Reader
In: lib/scraper/reader.rb

Methods

Classes and Modules

Class Scraper::Reader::HTMLParseError
Class Scraper::Reader::HTTPError
Class Scraper::Reader::HTTPInvalidURLError
Class Scraper::Reader::HTTPNoAccessError
Class Scraper::Reader::HTTPNotFoundError
Class Scraper::Reader::HTTPRedirectLimitError
Class Scraper::Reader::HTTPTimeoutError
Class Scraper::Reader::HTTPUnspecifiedError

Constants

REDIRECT_LIMIT = 3
DEFAULT_TIMEOUT = 30
PARSERS = [:tidy, :html_parser]
TIDY_OPTIONS = { :output_xhtml=>true, :show_errors=>0, :show_warnings=>false, :wrap=>0, :wrap_sections=>false, :force_output=>true, :quiet=>true, :tidy_mark=>false
Page = Struct.new(:url, :content, :encoding, :last_modified, :etag)
Parsed = Struct.new(:document, :encoding)

Public Instance methods

Parses an HTML page and returns the encoding and HTML element. Raises HTMLParseError exceptions if it cannot parse the HTML.

Options are passed to the parser. For example, when using Tidy you can pass Tidy cleanup options in the hash.

The last option specifies which parser to use (see PARSERS). By default Tidy is used.

Reads a Web page and return its URL, content and cache control headers.

The request reads a Web page at the specified URL (must be a URI object). It accepts the following options:

  • :last_modified — Last modified header (from a previous request).
  • :etag — ETag header (from a previous request).
  • :redirect_limit — Number of redirects allowed (default is 3).
  • :user_agent — The User-Agent header to send.
  • :timeout — HTTP open connection/read timeouts (in second).

It returns a hash with the following information:

  • :url — The URL of the requested page (may change by permanent redirect)
  • :content — The content of the response (may be nil if cached)
  • :content_type — The HTML page Content-Type header
  • :last_modified — Last modified cache control header (may be nil)
  • :etag — ETag cache control header (may be nil)
  • :encoding — Document encoding for the page

If the page has not been modified from the last request, the content is nil.

Raises HTTPError if an error prevents it from reading the page.

Protected Instance methods

[Validate]