Plugin Data Filtering & Sampling Strategies

This document explains how Auspex plugin tools filter and sample articles for analysis, and how to extend the system with new filtering strategies.

Overview

When a user requests analysis (e.g., "Partisan Analysis", "Future Impact"), the system needs to fetch relevant articles. Rather than using generic vector search (which clusters around semantic similarity), specialized plugins use targeted SQL queries to fetch articles with the specific data fields needed for that analysis type.

Architecture

User Request ("Partisan Analysis for __all__")


┌─────────────────────────────────────┐
│  _should_use_tools()                │  ← Keyword detection
│  (auspex_service.py)                │
│  Checks for: "bias", "partisan",    │
│  "future", "impact", etc.           │
└─────────────────────────────────────┘


┌─────────────────────────────────────┐
│  _check_plugin_tools()              │  ← Topic conversion
│  (auspex_service.py)                │
│  Converts __all__ → None for        │
│  cross-topic mode                   │
└─────────────────────────────────────┘


┌─────────────────────────────────────┐
│  tool_registry.find_matching_tools()│  ← Plugin matching
│  (tool_plugin_base.py)              │
│  Matches triggers in tool.md files  │
└─────────────────────────────────────┘


┌─────────────────────────────────────┐
│  tool_registry.execute_tool()       │  ← Handler dispatch
│  Gets PromptOnlyToolHandler for     │
│  the matched tool                   │
└─────────────────────────────────────┘


┌─────────────────────────────────────┐
│  PromptOnlyToolHandler.execute()    │  ← Action execution
│  (tool_plugin_base.py)              │
│  Runs actions defined in tool.md    │
└─────────────────────────────────────┘


┌─────────────────────────────────────┐
│  DatabaseQueryFacade methods        │  ← Data filtering
│  get_articles_with_bias_data()      │
│  get_articles_with_future_signals() │
└─────────────────────────────────────┘


┌─────────────────────────────────────┐
│  Distribution calculation           │  ← Statistics
│  _get_bias_distribution_from_field()│
│  _get_future_signal_distribution()  │
└─────────────────────────────────────┘


┌─────────────────────────────────────┐
│  LLM Prompt with {placeholders}     │  ← Analysis
│  {articles}, {bias}, {future_signals}│
└─────────────────────────────────────┘

Current Filtering Strategies

1. Bias Analysis (bias_analysis action)

Purpose: Fetch articles that have political bias data for partisan/media bias analysis.

Database Fields Used:

  • bias - Political leaning (e.g., "right-center", "left", "least biased")

  • bias_source - Source of the bias rating

  • news_source - For grouping by source

Facade Method: get_articles_with_bias_data(topic_name, limit, days_back)

Normalization: Raw bias values are normalized to standard categories:

  • "right-center" → "Center-Right"

  • "left-center" → "Center-Left"

  • "least biased" → "Center"

  • "far right", "extreme right" → "Far-Right"

  • etc.

Distribution Method: _get_bias_distribution_from_field(articles)

Returns:

2. Future Impact Analysis (future_impact_analysis action)

Purpose: Fetch articles with forward-looking predictions and impact timelines.

Database Fields Used:

  • future_signal - Prediction type (e.g., "AI will accelerate", "Escalation into armed conflict")

  • time_to_impact - Timeline (e.g., "Immediate (0-6 months)", "Mid-term (18-60 months)")

  • sentiment - Article sentiment (stronger sentiments often correlate with stronger predictions)

Facade Method: get_articles_with_future_signals(topic_name, limit, days_back)

Distribution Method: _get_future_signal_distribution(articles)

Returns:

3. Sentiment Analysis (sentiment_analysis action)

Purpose: Analyze sentiment distribution across articles.

Database Fields Used:

  • sentiment - Article sentiment (Positive, Negative, Neutral, Critical, Hyperbolic)

Distribution Method: _get_sentiment_distribution(db, topic, articles)

4. Vector Search (default)

Purpose: Semantic similarity search for general queries.

When Used: When no specialized action is defined, or for vector_search action.

Limitation: Clusters around semantically similar content, may not provide diverse samples.

Cross-Topic Mode

When topic is __all__ or None, the system operates in cross-topic mode:

In cross-topic mode:

  • No topic filter is applied to SQL queries

  • Articles from all topics are included

  • Distribution calculations span all topics

Adding a New Filtering Strategy

Step 1: Add Database Facade Method

In app/database_query_facade.py:

Step 2: Add Action Handler

In app/services/tool_plugin_base.py, add to the execute() method:

Step 3: Add Distribution Method

In app/services/tool_plugin_base.py:

Step 4: Create Plugin Definition

Create data/auspex/plugins/custom_analysis/tool.md:

Step 5: Add Trigger Keywords

In app/services/auspex_service.py, add to _should_use_tools():

Available Article Fields

The articles table contains these fields that can be used for filtering:

Field
Type
Description
Example Values

bias

text

Political bias rating

"right-center", "left", "least biased"

bias_source

text

Source of bias rating

"mediabiasfactcheck.com"

sentiment

text

Article sentiment

"Positive", "Negative", "Neutral", "Critical"

future_signal

text

Future prediction type

"AI will accelerate", "Escalation into armed conflict"

time_to_impact

text

Impact timeline

"Immediate (0-6 months)", "Mid-term (18-60 months)"

category

text

Article category

"Technology", "Politics"

topic

text

Topic name

"AI", "Geopolitics"

news_source

text

Source domain

"reuters.com", "bbc.com"

publication_date

date

Publication date

"2025-01-15"

Query current field distributions:

Best Practices

  1. Always filter for non-null values - Many articles don't have all fields populated

  2. Use the facade pattern - Don't write raw SQL in handlers; add methods to DatabaseQueryFacade

  3. Provide fallbacks - If specialized query returns no results, fall back gracefully

  4. Log fetched counts - Helps debug when results seem wrong

  5. Normalize values - Raw data often has inconsistent formatting

  6. Include distribution in prompt - Give the LLM statistical context alongside articles

  7. Test with cross-topic mode - Ensure topic=None works correctly

Debugging

Check logs for:

If you see low counts or errors, check:

  1. Database has articles with the required fields populated

  2. Date range isn't too restrictive

  3. Topic filter isn't excluding all results

Last updated