webinfo

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 28, 2025 License: Apache-2.0 Imports: 23 Imported by: 0

README

webinfo -- Extract metadata and structured information from web pages

lint status GitHub license GitHub release

webinfo is a small Go module that extracts common metadata from web pages and provides utilities to download representative images and create thumbnails.

Quick overview

  • Package: webinfo
  • Repository: github.com/goark/webinfo
  • Purpose: fetch page metadata (title, description, canonical, image, etc.) and download images

Features

  • Fetch page metadata with Fetch (handles encodings and meta tag precedence).
  • Download an image referenced by Webinfo.ImageURL using (*Webinfo).DownloadImage.
  • Create a thumbnail from the referenced image using (*Webinfo).DownloadThumbnail.

Install

Use Go modules (Go 1.25+ as used by the project):

go get github.com/goark/webinfo@latest

Basic usage

Example showing fetch and download thumbnail (error handling omitted for brevity):

package main

import (
    "context"
    "fmt"

    "github.com/goark/webinfo"
)

func main() {
    ctx := context.Background()
    // Fetch metadata for a page (empty UA uses default)
    info, err := webinfo.Fetch(ctx, "https://text.baldanders.info/", "")
    if err != nil {
        fmt.Printf("error detail:\n%+v\n", err)
        return
    }

    // Download thumbnail: width 150, to directory "thumbnails", permanent file
    thumbPath, err := info.DownloadThumbnail(ctx, "thumbnails", 150, false)
    if err != nil {
        fmt.Printf("error detail:\n%+v\n", err)
        return
    }
    fmt.Println("thumbnail saved:", thumbPath)
}
API notes
  • Fetch(ctx, url, userAgent) — Parse and extract metadata. Pass an empty userAgent to use the module default.
  • (*Webinfo).DownloadImage(ctx, destDir, temporary) — Download the image in Webinfo.ImageURL and save it. If temporary is true (or destDir is empty), a temporary file is created.
  • (*Webinfo).DownloadThumbnail(ctx, destDir, width, temporary) — Download the referenced image and produce a thumbnail resized to width pixels (height is preserved by aspect ratio). If destDir is empty the method creates a temporary file; when temporary is false the thumbnail file is named based on the original image name with -thumb appended before the extension.

Note on defaults and test hooks:

  • Default width: If width <= 0 is passed to DownloadThumbnail, the method uses a default width of 150 pixels.
  • Extension detection: DownloadImage determines an output extension from the URL path, the response Content-Type (via mime.ExtensionsByType), or by sniffing up to the first 512 bytes with http.DetectContentType.
  • Test hooks / injection points: For easier testing the package exposes a few package-level variables that tests can override:
    • createFile: used to create temporary or permanent files (wraps os.CreateTemp / os.Create). Override to simulate file-creation failures.
    • decodeImage: wrapper around image.Decode used by DownloadThumbnail — override to simulate decode results (for example, to return a zero-dimension image).
    • outputImage: encoder that writes the thumbnail image to disk (wraps jpeg.Encode, png.Encode, etc.). Override to simulate encoder failures.

These hooks are intended for tests and let callers reproduce rare I/O or encoding failures without changing production behavior.

  • HTTP client timeout: DownloadImage uses an HTTP client with a default 30-second Timeout for the whole request; tests can override this by replacing the newHTTPClient package variable.

Test examples

Below are short examples showing how to override the package-level hooks from a test to simulate failures. These snippets are intended for *_test.go files and assume the usual testing and net/http/httptest helpers.

  1. Simulate thumbnail temporary-file creation failure (override createFile):
// in your test function
orig := createFile
defer func() { createFile = orig }()
createFile = func(temp bool, dir, pattern string) (*os.File, error) {
  // fail only for thumbnail temp pattern
  if temp && strings.Contains(pattern, "webinfo-thumb-") {
    return nil, errors.New("simulated thumbnail temp create failure")
  }
  return orig(temp, dir, pattern)
}

// then call the method under test
_, err := info.DownloadThumbnail(ctx, t.TempDir(), 50, true)
// assert err != nil
  1. Simulate a zero-dimension decoded image (override decodeImage):
origDecode := decodeImage
defer func() { decodeImage = origDecode }()
decodeImage = func(r io.Reader) (image.Image, string, error) {
  // return an image with zero width to hit the origW==0 error path
  return image.NewRGBA(image.Rect(0, 0, 0, 10)), "png", nil
}

_, err := info.DownloadThumbnail(ctx, t.TempDir(), 50, true)
// assert err != nil
  1. Simulate encoder failure when writing thumbnails (override outputImage):
origOut := outputImage
defer func() { outputImage = origOut }()
outputImage = func(dst *os.File, src *image.RGBA, format string) error {
  return errors.New("simulated encode failure")
}

_, err := info.DownloadThumbnail(ctx, t.TempDir(), 50, true)
// assert err != nil

Notes:

  • Ensure your test imports include errors, io, image, and strings as needed.
  • Restore the original variables with defer to avoid cross-test interference.
  • These examples are intentionally minimal — adapt them to your test fixtures (httptest servers, temp dirs, etc.).
  1. Simulate HTTP client timeout by overriding newHTTPClient:
origClient := newHTTPClient
defer func() { newHTTPClient = origClient }()
newHTTPClient = func() *http.Client {
  // short timeout for test
  return &http.Client{Timeout: 50 * time.Millisecond}
}

// then call DownloadImage which uses newHTTPClient()
_, err := info.DownloadImage(ctx, t.TempDir(), true)
// assert err != nil (expect timeout)
Error handling

The package uses github.com/goark/errs for wrapping errors with contextual keys (e.g. url, path, dir). Callers should inspect returned errors accordingly.

Tests & development
  • Run all tests: go test ./...
  • The repository includes Taskfile.yml tasks for common workflows; see that file for CI/test commands.

Modules Requirement Graph

dependency.png

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	ErrNullPointer = errors.New("null reference instance")
	ErrNoImageURL  = errors.New("no image URL")
	ErrInvalidURL  = errors.New("invalid URL")
)

Functions

This section is empty.

Types

type Webinfo

type Webinfo struct {
	URL         string `json:"url,omitempty"`         // Original page URL
	Location    string `json:"location,omitempty"`    // Location
	Canonical   string `json:"canonical,omitempty"`   // Canonical URL
	Title       string `json:"title,omitempty"`       // Page title
	Description string `json:"description,omitempty"` // Meta description
	ImageURL    string `json:"image_url,omitempty"`   // Representative image URL
	UserAgent   string `json:"user_agent,omitempty"`  // User-Agent used to fetch the page
}

Webinfo holds common metadata extracted from a web page. It captures information useful for previews or link metadata:

- URL: the original page URL. - Location: the location declared by the page (if any). - Canonical: the canonical URL declared by the page (if any). - Title: the page title. - Description: a short summary or meta description for the page. - ImageURL: a representative image URL suitable for previews. - UserAgent: the User-Agent string used to fetch the page.

Fields may be empty or nil when the corresponding metadata is not present.

func Fetch

func Fetch(ctx context.Context, urlStr, userAgent string) (info *Webinfo, err error)

Fetch retrieves metadata from the web page at urlStr and returns it as a *Webinfo.

Behavior:

  • Parses urlStr and performs an HTTP GET using the provided context (ctx).
  • If userAgent is empty, a default dummy User-Agent string is used.
  • Uses an HTTP client and sets the User-Agent request header.
  • Reads up to the first 1024 bytes of the response to detect the page character encoding via charset.DetermineEncoding (also considers the response Content-Type). If an encoding is detected or inferred by name, the response body is decoded accordingly before HTML parsing.

Parsing and extracted fields: - Parses the document head with goquery and extracts:

  • Title: from <title>, then overridden by meta[property="twitter:title"] or meta[property="og:title"] if present.
  • Description: from meta[name="description"], then overridden by meta[property="twitter:description"] or meta[property="og:description"].
  • ImageURL: from meta[property="twitter:image"] or meta[property="og:image"].
  • Canonical: from link[rel="canonical"].

- The returned Webinfo contains at least:

  • URL: the original urlStr (string form).
  • Location: the final request URL (after redirects) from the response.
  • UserAgent: the User-Agent actually used.

Error handling and resource cleanup: - Network, URL parsing, encoding detection, and HTML parsing errors are wrapped with contextual information (including the URL). - The response body is closed in a deferred function; any close error is joined with the returned error. - On error, Fetch returns a nil *Webinfo and a non-nil error.

Notes and guarantees: - The first 1024 bytes are peeked (without advancing the reader) to determine encoding. - DetermineEncoding's boolean return value is ignored (some encodings like Shift_JIS may be reported inconsistently); the detected encoding or a named encoding (via encoding.GetEncoding) is preferred. - The function honors context cancellation for the HTTP request. - Caller should assume that a non-nil *Webinfo is returned only on success; otherwise, info is nil.

func (*Webinfo) DownloadImage

func (w *Webinfo) DownloadImage(ctx context.Context, destDir string, temporary bool) (outPath string, err error)

DownloadImage downloads the image pointed to by w.ImageURL and saves it to destDir, returning the path of the saved file (outPath) or an error.

Behavior:

  • The method is a receiver on *Webinfo and will return an error if w is nil or if ImageURL is empty.
  • ctx is used to control/cancel the underlying HTTP request.
  • destDir is cleaned with filepath.Clean. If it is non-empty, the directory (and any required parents) will be created with mode 0750. If destDir is empty, file creation uses the system/default behavior for temporary or current directories.
  • If `temporary` is true, the image is written to a temporary file (created via the package-level `createFile` helper which wraps `os.CreateTemp`) and the temporary file path is returned. If the URL path does not contain a filename, `temporary` is forced true.
  • If `temporary` is false, the image is written to `destDir` with the filename taken from the URL path. If the URL filename has no extension, an extension is appended (see extension resolution below). Existing files will be truncated by the underlying `createFile`/`os.Create` behavior.

HTTP download and content-type/extension resolution:

  • The image is fetched using an HTTP GET performed with the provided context; the request User-Agent is set via getUserAgent(w.UserAgent).
  • Extension resolution order: 1) Extension from the URL path (if present). 2) Extension(s) derived from the Content-Type response header via mime.ExtensionsByType. 3) If still unknown, the first up-to-512 bytes of the body are read and http.DetectContentType is used to guess the content type, then mime.ExtensionsByType. 4) If no extension can be determined, ".img" is used as a fallback.
  • When bytes are sniffed from the body, they are prepended back to the reader so the full image is written to disk. When multiple extensions are returned by mime.ExtensionsByType the implementation picks the last returned extension.
  • File creation is performed via the package-level `createFile` variable which tests may override to simulate create failures.

Resource management and errors:

  • The response body and any created file are closed using deferred cleanup; any close errors are joined into the returned error.
  • I/O, network and OS errors are returned (wrapped with contextual information).
  • On success, outPath contains the absolute/relative path to the saved image file; on error, outPath will be empty and err will describe the failure.

Notes:

  • The function may truncate an existing destination file with the same name.
  • The exact behavior of temporary file placement when destDir is empty follows the semantics of os.CreateTemp.

func (*Webinfo) DownloadThumbnail

func (w *Webinfo) DownloadThumbnail(ctx context.Context, destDir string, width int, temporary bool) (outPath string, err error)

DownloadThumbnail downloads the image referenced by the Webinfo receiver, scales it to the requested width (preserving aspect ratio), and writes the resulting thumbnail image to disk.

The method returns the path to the created thumbnail file or an error. Behavior details:

  • If the receiver is nil, ErrNullPointer is returned.
  • If width <= 0, a default width of 150 pixels is used.
  • destDir is cleaned and, if non-empty, created with mode 0750 (os.MkdirAll).
  • The original image is always downloaded to a temporary file via DownloadImage(..., true). That temporary original file is removed when the function returns (even on error).
  • The original image file is opened and decoded. If decoding fails, an error is returned.
  • The thumbnail height is computed to preserve aspect ratio: newH = round(width * origH / origW). newH is clamped to at least 1 pixel.
  • The image is resized using a Catmull-Rom resampler into an RGBA image of size width x newH.
  • The output format/extension is chosen from the decoded format: jpeg/jpg → .jpg, png → .png, gif → .gif. Unknown formats fall back to PNG.
  • If `temporary` is true, the thumbnail file is created via the package-level `createFile` helper (which wraps `os.CreateTemp`) in `destDir` using the pattern "webinfo-thumb-*<ext>"; the temporary file path is returned.
  • If `temporary` is false, the output filename is derived from the original image URL basename (falling back to "webinfo-image") and named "<base>-thumb<ext>" in `destDir`.
  • The encoder used to write the thumbnail is the package-level `outputImage` function variable; tests may replace this variable to simulate encoder failures. The image decoding step uses the package-level `decodeImage` wrapper around `image.Decode`, which tests may also override.
  • Files are properly closed with deferred cleanup; any close/remove errors are joined into the returned error using the errs package.
  • All filesystem, download, and image-processing errors are wrapped with contextual information (e.g., paths, URL) before being returned.

Parameters:

  • ctx: context for cancellation and timeouts passed to DownloadImage and other operations.
  • destDir: destination directory for the thumbnail (cleaned). If empty, creation uses the current directory semantics of os.Create/os.CreateTemp.
  • width: desired thumbnail width in pixels (defaults to 150 if <= 0).
  • temporary: if true, create a uniquely-named temporary file; otherwise create a stable filename based on the original image basename.

Returns:

  • outPath: filesystem path to the created thumbnail file (valid when err == nil).
  • err: non-nil on failure; common failure reasons include download errors, decode errors, filesystem errors, and invalid image dimensions (ErrNoImageURL).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL