MetaGPT智能体开发入门5-通用订阅智能体

编程日记2024-03-20 15:35:20

本节课任务
- 使用llm提取出需要的信息而不是写爬虫代码
- 用ActionNode重写订阅智能体，实现自然语言爬取解析网站内容。教程中订阅智能体是通过RunSubscription运行的，即RunSubscription这个action，不仅创建了订阅智能体代码，并启动了SubscriptionRunner，这会让我们的RunSubscription一直无法退出，请尝试将二者分离，即从RunSubscription分离出AddSubscriptionTask的action，并且让SubscriptionRunner单独运行（可以是同一个进程也可以是不同的进程）

在MetaGPT智能体开发入门3-订阅智能体(OSS)实践
这节课中，我们动手实现了一个OSS（Open Source Software）订阅智能体，实现了Github Trending和Huggingface Papers页面的爬取。当我们需要订阅另外一个数据源的时，我们需要再手动写一个Role。这就会让我们的订阅智能体实现起来需要较多的开发成本，那有没有什么办法让Role变得通用呢？

"让Role变得通用"的思路

思路1
实现一个可以爬取我们要求的任意网站的智能体，并使用llm提取出需要的信息而不是写爬虫代码。需要完成：

爬取网站html
MetaGPT智能体开发入门3-订阅智能体(OSS)实践
这节课中OSS（Open Source Software）中的实现有个缺点，在取动态页面，需要执行JavaScript代码或绕过简单的反爬措施等需求时，都需要花费更高的成本。解决方案：使用浏览器自动化的方式来爬取网页， python中也有很多的浏览器自动化工具，例如Selenium、Playwright，所以正常人可以浏览的页面，基本都可以通过这个方式进行爬取。
LLM提取信息 将html发送给LLM，用LLM提取出html需要的信息

爬取到网站html后，还要提取里面的数据（比如当天的Github trending的列表），如果直接将页面内容交给大模型处理，会消耗太多的Token，成本较高，如何降低这个成本，见后面思路2.

通过用LLM提取出html需要的信息

我们先完成思路1中的第2部分，将html发送给LLM，用LLM提取出html需要的信息。这里我们以Huggingface Papers页面爬取为例。原来是通过 aiohttp ->bs4 ，现在修改为aiohttp->llm，
比如原来获取papers的列表的实现如下：

def hg_article_urls(html_soup):
    _urls = []
    for article in html_soup.select('article.flex.flex-col.overflow-hidden.rounded-xl.border'):
        url = article.select_one('h3 a')['href']
        _urls.append('https://huggingface.co' + url)
    return _urls

修改为使用大模型提取html中的信息：

PROMPT_TEMPLATE = """Please extract a portion of content from HTML text to achieve the User Requirement with \
the HTML content provided in the Context.

## User Requirement
{requirement}

## Context
The html page content to be extracted is show like below:

\```tree
{html}
\```
"""

URLS_REQUIREMENT = "Extracting a list of URLs for Daily Papers from HTML text,"\
               "Just give me the URLs, exactly as `https://huggingface.co/papers/xxx,https://huggingface.co/papers/xxx, ...`, "\
               "don't give me python code, don't output any unnecessary characters"

def extract_urls(text):
    pattern = re.compile(r'https:\/\/huggingface\.co\/papers\/\d+\.\d+')
    return pattern.findall(text)


async def hg_article_urls(html):
    global llm
    prompt = PROMPT_TEMPLATE.format(html=html,
                                       requirement=URLS_REQUIREMENT)
    resp = await llm.aask(prompt)
    _urls = list(extract_urls(resp))
    print(', '.join(_urls))
    return _urls

输出如下：

The list of URLs for Daily Papers from the HTML text are:

- <https://huggingface.co/papers/2401.13627>
- <https://huggingface.co/papers/2401.13601>
- <https://huggingface.co/papers/2401.13660>
- <https://huggingface.co/papers/2401.13160>
- <https://huggingface.co/papers/2401.13388>
- <https://huggingface.co/papers/2401.13303>
- <https://huggingface.co/papers/2401.13311>
2024-01-25 23:24:06.749 | INFO     | metagpt.utils.cost_manager:update_cost:48 - Total running cost: ARTICLE_REQUIREMENT="Extracting a list of article infomation for Paper from HTML text," \
               "Just give me the article infomation in the json format, exactly as \
                `{'id':id, 'title':title, 'upvotes':upvotes, 'publishedAt':publishedAt, 'summary':summary}` " \
               "don't give me python code, don't output any unnecessary characters"

def extract_article(text):
    return json.loads(text.replace('```json','').replace('```',''))


async def hg_article_infos(_url, html):
    global llm
    logger.info(f'Parsing {_url}')
    prompt = PROMPT_TEMPLATE.format(html=html,
                                       requirement=ARTICLE_REQUIREMENT)
    resp = await llm.aask(prompt)
    _article = extract_article(resp)
    _article = {'url':_url}
    return _article


async def get_hg_articles():
    _, _html = await get_html("https://huggingface.co/papers")
    hg_urls = await hg_article_urls(_html)
    _htmls = await asyncio.gather(*[get_html(url) for url in hg_urls])
    hg_articles = await asyncio.gather(*map(lambda param: hg_article_infos(param[0], param[1]), _htmls))

    return list(hg_articles)

.002 | Max budget: .000 | Current cost: {'id': '2401.13627', 'title': 'Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild', 'upvotes': 19, 'publishedAt': '2024-01-24T17:58:07.000Z', 'summary': "We introduce SUPIR (Scaling-UP Image Restoration), a groundbreaking image restoration method that harnesses generative prior and the power of model scaling up. Leveraging multi-modal techniques and advanced generative prior, SUPIR marks a significant advance in intelligent and realistic image restoration. As a pivotal catalyst within SUPIR, model scaling dramatically enhances its capabilities and demonstrates new potential for image restoration. We collect a dataset comprising 20 million high-resolution, high-quality images for model training, each enriched with descriptive text annotations. SUPIR provides the capability to restore images guided by textual prompts, broadening its application scope and potential. Moreover, we introduce negative-quality prompts to further improve perceptual quality. We also develop a restoration-guided sampling method to suppress the fidelity issue encountered in generative-based restoration. Experiments demonstrate SUPIR's exceptional restoration effects and its novel capacity to manipulate restoration through textual prompts.", 'url': 'https://huggingface.co/papers/2401.13627'}
{'id': '2401.13601', 'title': 'MM-LLMs: Recent Advances in MultiModal Large Language Models', 'upvotes': 18, 'publishedAt': '2024-01-24T17:10:45.000Z', 'summary': 'In the past year, MultiModal Large Language Models (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs. Specifically, we first outline general design formulations for model architecture and training pipeline. Subsequently, we provide brief introductions of 26 existing MM-LLMs, each characterized by its specific formulations. Additionally, we review the performance of MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs. Lastly, we explore promising directions for MM-LLMs while concurrently maintaining a real-time tracking website for the latest developments in the field. We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain.', 'url': 'https://huggingface.co/papers/2401.13601'}
{'id': '2401.13660', 'title': 'MambaByte: Token-free Selective State Space Model', 'upvotes': 17, 'publishedAt': '2024-01-24T18:53:53.000Z', 'summary': 'Token-free language models learn directly from raw bytes and remove the bias\nof subword tokenization. Operating on bytes, however, results in significantly\nlonger sequences, and standard autoregressive Transformers scale poorly in such\nsettings. We experiment with MambaByte, a token-free adaptation of the Mamba\nstate space model, trained autoregressively on byte sequences. Our experiments\nindicate the computational efficiency of MambaByte compared to other byte-level\nmodels. We also find MambaByte to be competitive with and even outperform\nstate-of-the-art subword Transformers. Furthermore, owing to linear scaling in\nlength, MambaByte benefits from fast inference compared to Transformers. Our\nfindings establish the viability of MambaByte in enabling token-free language\nmodeling.', 'url': 'https://huggingface.co/papers/2401.13660'}
{'id': '2401.13388', 'title': 'UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion', 'authors': ['Wei Li', 'Xue Xu', 'Jiachen Liu', 'Xinyan Xiao'], 'publishedAt': '2024-01-24T11:36:44.000Z', 'summary': 'Existing text-to-image diffusion models primarily generate images from text prompts. However, the inherent conciseness of textual descriptions poses challenges in faithfully synthesizing images with intricate details, such as specific entities or scenes. This paper presents UNIMO-G, a simple multimodal conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs, which demonstrates a unified ability for both text-driven and subject-driven image generation. UNIMO-G comprises two core components: a Multimodal Large Language Model (MLLM) for encoding multimodal prompts, and a conditional denoising diffusion network for generating images based on the encoded multimodal input. We leverage a two-stage training strategy to effectively train the framework: firstly pre-training on large-scale text-image pairs to develop conditional image generation capabilities, and then instruction tuning with multimodal prompts to achieve unified image generation proficiency. A well-designed data processing pipeline involving language grounding and image segmentation is employed to construct multi-modal prompts. UNIMO-G excels in both text-to-image generation and zero-shot subject-driven synthesis, and is notably effective in generating high-fidelity images from complex multimodal prompts involving multiple image entities.', 'upvotes': 7, 'url': 'https://huggingface.co/papers/2401.13388'}
{'id': '2401.13160', 'title': 'SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection', 'upvotes': 6, 'publishedAt': '2024-01-24T00:36:13.000Z', 'summary': 'Pre-training large language models is known to be extremely resource intensive and often times inefficient, under-utilizing the information encapsulated in the training text sequences. In this paper, we present SpacTor, a new training procedure consisting of (1) a hybrid objective combining span corruption (SC) and token replacement detection (RTD), and (2) a two-stage curriculum that optimizes the hybrid objective over the initial tau iterations, then transitions to standard SC loss. We show empirically that the effectiveness of the hybrid objective is tied to the two-stage pre-training schedule, and provide extensive analysis on why this is the case. In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training, while enabling a 50% reduction in pre-training iterations and 40% reduction in total FLOPs. Alternatively, given the same amount of computing budget, we find that SpacTor results in significantly improved downstream benchmark performance.', 'url': 'https://huggingface.co/papers/2401.13160'}
{'id': '2401.13303', 'title': 'MaLA-500: Massive Language Adaptation of Large Language Models', 'upvotes': 4, 'publishedAt': '2024-01-24T08:57:39.000Z', 'summary': 'Large language models have advanced the state of the art in natural language\nprocessing. However, their predominant design for English or a limited set of\nlanguages creates a substantial gap in their effectiveness for low-resource\nlanguages. To bridge this gap, we introduce MaLA-500, a novel large language\nmodel designed to cover an extensive range of 534 languages. To train MaLA-500,\nwe employ vocabulary extension and continued pretraining on LLaMA 2 with\nGlot500-c. Our experiments on SIB-200 show that MaLA-500 achieves\nstate-of-the-art in-context learning results. We release MaLA-500 at\nhttps://huggingface.co/MaLA-LM', 'url': 'https://huggingface.co/papers/2401.13303'}
{'id': '2401.13311', 'title': 'ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models', 'upvotes': 4, 'publishedAt': '2024-01-24T09:07:11.000Z', 'summary': "Recent advancements in AI have led to the development of large multimodal models (LMMs) capable of processing complex tasks involving joint reasoning over text and visual content in the image (e.g., navigating maps in public places). This paper introduces ConTextual, a novel benchmark comprising instructions designed explicitly to evaluate LMMs' ability to perform context-sensitive text-rich visual reasoning. ConTextual emphasizes diverse real-world scenarios (e.g., time-reading, navigation, shopping and more) demanding a deeper understanding of the interactions between textual and visual elements. Our findings reveal a significant performance gap of 30.8% between the best-performing LMM, GPT-4V(ision), and human capabilities using human evaluation indicating substantial room for improvement in context-sensitive text-rich visual reasoning. Notably, while GPT-4V excelled in abstract categories like meme and quote interpretation, its overall performance still lagged behind humans. In addition to human evaluations, we also employed automatic evaluation metrics using GPT-4, uncovering similar trends in performance disparities. We also perform a fine-grained evaluation across diverse visual contexts and provide qualitative analysis which provides a robust framework for future advancements in the LMM design. https://con-textual.github.io/", 'url': 'https://huggingface.co/papers/2401.13311'}

.002, prompt_tokens: 31247, completion_tokens: 134
https://huggingface.co/papers/2401.13627, https://huggingface.co/papers/2401.13601, https://huggingface.co/papers/2401.13660, https://huggingface.co/papers/2401.13160, https://huggingface.co/papers/2401.13388, https://huggingface.co/papers/2401.13303, https://huggingface.co/papers/2401.13311

同样，对每个paper的url使用大模型直接提取信息：

import asyncio
import json
import re

import aiohttp
from bs4 import BeautifulSoup

from metagpt.config import CONFIG
from metagpt.llm import LLM
from metagpt.logs import logger


def get_local_html_soup(url, features='html.parser'):
    with open(url, encoding="utf-8") as f:
        html = f.read()
    soup = BeautifulSoup(html, features)
    return soup


async def get_html(url: str):
    async with aiohttp.ClientSession() as client:
        async with client.get(url, proxy=CONFIG.global_proxy) as response:
            response.raise_for_status()
            html = await response.text()

    return url, html


PROMPT_TEMPLATE = """Please extract a portion of content from HTML text to achieve the User Requirement with \
the HTML content provided in the Context.

## User Requirement
{requirement}

## Context
The html page content to be extracted is show like below:

\```tree
{html}
\```
"""

URLS_REQUIREMENT = "Extracting a list of URLs for Daily Papers from HTML text,"\
               "Just give me the URLs, exactly as `https://huggingface.co/papers/xxx,https://huggingface.co/papers/xxx, ...`, "\
               "don't give me python code, don't output any unnecessary characters"

def extract_urls(text):
    pattern = re.compile(r'https:\/\/huggingface\.co\/papers\/\d+\.\d+')
    return pattern.findall(text)


async def hg_article_urls(html):
    global llm
    prompt = PROMPT_TEMPLATE.format(html=html,
                                       requirement=URLS_REQUIREMENT)
    resp = await llm.aask(prompt)
    _urls = list(extract_urls(resp))
    print(', '.join(_urls))
    return _urls


ARTICLE_REQUIREMENT="Extracting a list of article infomation for Paper from HTML text," \
               "Just give me the article infomation in the json format, exactly as \
                `{'id':id, 'title':title, 'upvotes':upvotes, 'publishedAt':publishedAt, 'summary':summary}` " \
               "don't give me python code, don't output any unnecessary characters"


def extract_article(text):
    return json.loads(text.replace('```json','').replace('```',''))


async def hg_article_infos(_url, html):
    global llm
    logger.info(f'Parsing {_url}')
    prompt = PROMPT_TEMPLATE.format(html=html,
                                       requirement=ARTICLE_REQUIREMENT)
    resp = await llm.aask(prompt)
    _article = extract_article(resp)
    _article['url'] = _url
    return _article


async def get_hg_articles():
    _, _html = await get_html("https://huggingface.co/papers")
    hg_urls = await hg_article_urls(_html)
    _htmls = await asyncio.gather(*[get_html(url) for url in hg_urls])
    hg_articles = await asyncio.gather(*map(lambda param: hg_article_infos(param[0], param[1]), _htmls))

    return list(hg_articles)


if __name__ == "__main__":
    import asyncio

    llm = LLM()
    for article in asyncio.run(get_hg_articles()):
        print(article)

最终也可以获取到huggingface paper的信息

非常耗费token，每次都会把html整个扔给大模型处理，测试下来100多万token给干没有了

完整代码如下：

一点也不轻松，甚至比写爬虫的代码还要费时间

可以看出，这种方式有很多缺点：

没有手动编写的爬虫代码来的精确

思路2

爬取网站html

还是尽快尝试思路2，让ActionNode上场吧！

Selenium

Playwright 采用和思路1相同方式爬取，即：使用浏览器自动化的方式来爬取网页，例如html瘦身、问题结构化
用LLM自动编写爬虫代码 主要提供html的class属性信息，转成css表达式和对应的内容提供给llm，从而减少token的消耗。
将SubscriptionRunner从RunSubscription中分离
使用ActionNode将用户的问题结构化输出
class RunSubscription(Action): async def run(self, msgs): from metagpt.roles.role import Role code = msgs[-1].content req = msgs[-2].instruct_content.model_dump() urls = req["Crawler URL List"] process = req["Crawl Post Processing"] spec = req["Cron Expression"] SubAction = self.create_sub_action_cls(urls, code, process) SubRole = type("SubRole", (Role,), {}) role = SubRole() role.init_actions([SubAction]) runner = SubscriptionRunner() async def callback(msg): print(msg) trigger = CronTrigger(spec) await runner.subscribe(role, trigger, callback) return runner @staticmethod def create_sub_action_cls(urls: list[str], code: str, process: str): modules = {} for url in urls[::-1]: code, current = code.rsplit(f"# {url}", maxsplit=1) name = uuid4().hex module = type(sys)(name) exec(current, module.__dict__) modules[url] = module class SubAction(Action): async def run(self, *args, **kwargs): pages = await WebBrowserEngine().run(*urls) if len(urls) == 1: pages = [pages] data = [] for url, page in zip(urls, pages): data.append(getattr(modules[url], "parse")(page.soup)) return await self.llm.aask(SUB_ACTION_TEMPLATE.format(process=process, data=data)) return SubAction class AddSubscriptionTask(Action): async def run(self, runner: SubscriptionRunner): logger.info("I will run SubscriptionRunner") await runner.run() # 定义订阅助手角色 class SubscriptionAssistant(Role): """Analyze user subscription requirements.""" name: str = "Grace" profile: str = "Subscription Assistant" goal: str = "analyze user subscription requirements to provide personalized subscription services." constraints: str = "utilize the same language as the User Requirement" def __init__(self, **kwargs) -> None: super().__init__(**kwargs) self._init_actions([ParseSubRequirement, RunSubscription, AddSubscriptionTask]) self._watch([UserRequirement, WriteCrawlerCode]) async def _think(self) -> bool: cause_by = self.rc.history[-1].cause_by if cause_by == any_to_str(UserRequirement): state = 0 elif cause_by == any_to_str(WriteCrawlerCode): state = 1 elif cause_by == any_to_str(RunSubscription): state = 2 if self.rc.state == state: self.rc.todo = None return False self._set_state(state) return True async def _act(self) -> Message: logger.info(f"{self._setting}: to do {self.rc.todo}") response = await self.rc.todo.run(self.rc.history) if isinstance(response, (ActionOutput, ActionNode)): msg = Message( content=response.content, instruct_content=response.instruct_content, role=self._setting, cause_by=self.rc.todo, sent_from=self, ) elif isinstance(response, Message): msg = response elif isinstance(response, SubscriptionRunner): self._set_state(2) msg = await self.rc.todo.run(response) else: msg = Message(content=response, role=self.profile, cause_by=self.rc.todo, sent_from=self) self.rc.memory.add(msg) return msg 将瘦身后的html和结构化的问题送给LLM，让LLM自动编写爬虫代码。

import sys
from typing import Optional, Any
from uuid import uuid4

from aiocron import crontab
from metagpt.actions import UserRequirement, ActionOutput
from metagpt.actions.action import Action
from metagpt.actions.action_node import ActionNode
from metagpt.logs import logger
from metagpt.roles import Role
from metagpt.schema import Message
from metagpt.subscription import SubscriptionRunner
from metagpt.tools.web_browser_engine import WebBrowserEngine
from metagpt.utils.common import CodeParser, any_to_str
from metagpt.utils.parse_html import _get_soup
from pytz import BaseTzInfo

\# 先写NODES
LANGUAGE = ActionNode(
    key="Language",
    expected_type=str,
    instruction="Provide the language used in the project, typically matching the user's requirement language.",
    example="en_us",
)

CRON_EXPRESSION = ActionNode(
    key="Cron Expression",
    expected_type=str,
    instruction="If the user requires scheduled triggering, please provide the corresponding 5-field cron expression. "
                "Otherwise, leave it blank.",
    example="",
)

CRAWLER_URL_LIST = ActionNode(
    key="Crawler URL List",
    expected_type=list[str],
    instruction="List the URLs user want to crawl. Leave it blank if not provided in the User Requirement.",
    example=["https://example1.com", "https://example2.com"],
)

PAGE_CONTENT_EXTRACTION = ActionNode(
    key="Page Content Extraction",
    expected_type=str,
    instruction="Specify the requirements and tips to extract from the crawled web pages based on User Requirement.",
    example="Retrieve the titles and content of articles published today.",
)

CRAWL_POST_PROCESSING = ActionNode(
    key="Crawl Post Processing",
    expected_type=str,
    instruction="Specify the processing to be applied to the crawled content, such as summarizing today's news.",
    example="Generate a summary of today's news articles.",
)

INFORMATION_SUPPLEMENT = ActionNode(
    key="Information Supplement",
    expected_type=str,
    instruction="If unable to obtain the Cron Expression, prompt the user to provide the time to receive subscription "
                "messages. If unable to obtain the URL List Crawler, prompt the user to provide the URLs they want to crawl. Keep it "
                "blank if everything is clear",
    example="",
)

NODES = [
    LANGUAGE,
    CRON_EXPRESSION,
    CRAWLER_URL_LIST,
    PAGE_CONTENT_EXTRACTION,
    CRAWL_POST_PROCESSING,
    INFORMATION_SUPPLEMENT,
]

PARSE_SUB_REQUIREMENTS_NODE = ActionNode.from_children("ParseSubscriptionReq", NODES)

PARSE_SUB_REQUIREMENT_TEMPLATE = """
\### User Requirement
{requirements}
"""

SUB_ACTION_TEMPLATE = """
\## Requirements
Answer the question based on the provided context {process}. If the question cannot be answered, please summarize the context.

\## context
{data}"
"""

PROMPT_TEMPLATE = """Please complete the web page crawler parse function to achieve the User Requirement. The parse \
function should take a BeautifulSoup object as input, which corresponds to the HTML outline provided in the Context.

\```python
from bs4 import BeautifulSoup

\# only complete the parse function
def parse(soup: BeautifulSoup):
    ...
    # Return the object that the user wants to retrieve, don't use print
\```

\## User Requirement
{requirement}

\## Context

The outline of html page to scrabe is show like below:

\```tree
{outline}
\```
"""


\# 辅助函数: 获取html css大纲视图
def get_outline(page):
    soup = _get_soup(page.html)
    outline = []

    def process_element(element, depth):
        name = element.name
        if not name:
            return
        if name in ["script", "style"]:
            return

        element_info = {"name": element.name, "depth": depth}

        if name in ["svg"]:
            element_info["text"] = None
            outline.append(element_info)
            return

        element_info["text"] = element.string
        # Check if the element has an "id" attribute
        if "id" in element.attrs:
            element_info["id"] = element["id"]

        if "class" in element.attrs:
            element_info["class"] = element["class"]
        outline.append(element_info)
        for child in element.children:
            process_element(child, depth + 1)

    for element in soup.body.children:
        process_element(element, 1)

    return outline


\# 触发器：crontab
class CronTrigger:
    def __init__(self, spec: str, tz: Optional[BaseTzInfo] = None) -> None:
        self.crontab = crontab(spec, tz=tz)

    def __aiter__(self):
        return self

    async def __anext__(self):
        await self.crontab.next()
        return Message()


\# 写爬虫代码的Action
class WriteCrawlerCode(Action):
    async def run(self, requirement):
        requirement: Message = requirement[-1]
        data = requirement.instruct_content.model_dump()
        urls = data["Crawler URL List"]
        query = data["Page Content Extraction"]

        codes = {}
        for url in urls:
            codes[url] = await self._write_code(url, query)
        return "\n".join(f"# {url}\n{code}" for url, code in codes.items())

    async def _write_code(self, url, query):
        page = await WebBrowserEngine().run(url)
        outline = get_outline(page)
        outline = "\n".join(
            f"{' ' * i['depth']}{'.'.join([i['name'], *i.get('class', [])])}: {i['text'] if i['text'] else ''}"
            for i in outline
        )
        code_rsp = await self._aask(PROMPT_TEMPLATE.format(outline=outline, requirement=query))
        code = CodeParser.parse_code(block="", text=code_rsp)
        return code


\# 分析订阅需求的Action
class ParseSubRequirement(Action):
    async def run(self, requirements):
        requirements = "\n".join(i.content for i in requirements)
        context = PARSE_SUB_REQUIREMENT_TEMPLATE.format(requirements=requirements)
        node = await PARSE_SUB_REQUIREMENTS_NODE.fill(context=context, llm=self.llm)
        return node


\# 运行订阅智能体的Action
class RunSubscription(Action):
    async def run(self, msgs):
        from metagpt.roles.role import Role

        code = msgs[-1].content
        req = msgs[-2].instruct_content.model_dump()
        urls = req["Crawler URL List"]
        process = req["Crawl Post Processing"]
        spec = req["Cron Expression"]
        SubAction = self.create_sub_action_cls(urls, code, process)
        SubRole = type("SubRole", (Role,), {})
        role = SubRole()
        role.name = 'XiaoGang'
        role.profile = 'Crawler'
        role.init_actions([SubAction])
        runner = SubscriptionRunner()

        async def callback(msg):
            print(msg)
        trigger = CronTrigger(spec)
        await runner.subscribe(role, trigger, callback)
        return runner

    @staticmethod
    def create_sub_action_cls(urls: list[str], code: str, process: str):
        modules = {}
        for url in urls[::-1]:
            code, current = code.rsplit(f"# {url}", maxsplit=1)
            name = uuid4().hex
            module = type(sys)(name)
            exec(current, module.__dict__)
            modules[url] = module

        class SubAction(Action):
            async def run(self, *args, **kwargs):
                pages = await WebBrowserEngine().run(*urls)
                if len(urls) == 1:
                    pages = [pages]

                data = []
                for url, page in zip(urls, pages):
                    data.append(getattr(modules[url], "parse")(page.soup))
                return await self.llm.aask(SUB_ACTION_TEMPLATE.format(process=process, data=data))

        return SubAction


class AddSubscriptionTask(Action):
    async def run(self, runner: SubscriptionRunner):
        logger.info("I will run SubscriptionRunner")
        await runner.run()


\# 定义爬虫工程师角色
class CrawlerEngineer(Role):
    name: str = "John"
    profile: str = "Crawling Engineer"
    goal: str = "Write elegant, readable, extensible, efficient code"
    constraints: str = "The code should conform to standards like PEP8 and be modular and maintainable"

    def __init__(self, **kwargs) -> None:
        super().__init__(**kwargs)

        self._init_actions([WriteCrawlerCode])
        self._watch([ParseSubRequirement])


# 定义订阅助手角色
class SubscriptionAssistant(Role):
    """Analyze user subscription requirements."""

    name: str = "Grace"
    profile: str = "Subscription Assistant"
    goal: str = "analyze user subscription requirements to provide personalized subscription services."
    constraints: str = "utilize the same language as the User Requirement"

    def __init__(self, **kwargs) -> None:
        super().__init__(**kwargs)

        self._init_actions([ParseSubRequirement, RunSubscription, AddSubscriptionTask])
        self._watch([UserRequirement, WriteCrawlerCode])

    async def _think(self) -> bool:
        cause_by = self.rc.history[-1].cause_by
        if cause_by == any_to_str(UserRequirement):
            state = 0
        elif cause_by == any_to_str(WriteCrawlerCode):
            state = 1
        elif cause_by == any_to_str(RunSubscription):
            state = 2
        if self.rc.state == state:
            self.rc.todo = None
            return False
        self._set_state(state)
        return True

    async def _act(self) -> Message:
        logger.info(f"{self._setting}: to do {self.rc.todo}")
        response = await self.rc.todo.run(self.rc.history)
        if isinstance(response, (ActionOutput, ActionNode)):
            msg = Message(
                content=response.content,
                instruct_content=response.instruct_content,
                role=self._setting,
                cause_by=self.rc.todo,
                sent_from=self,
            )
        elif isinstance(response, Message):
            msg = response
        elif isinstance(response, SubscriptionRunner):
            self._set_state(2)
            msg = await self.rc.todo.run(response)
        else:
            msg = Message(content=response, role=self.profile, cause_by=self.rc.todo, sent_from=self)
        self.rc.memory.add(msg)
        return msg


if __name__ == "__main__":
    import asyncio
    from metagpt.team import Team

    team = Team()
    team.hire([SubscriptionAssistant(), CrawlerEngineer()])
    team.run_project(
        "从36kr创投平台https://pitchhub.36kr.com/financing-flash爬取所有初创企业融资的信息，获取标题，链接， 时间，总结今天的融资新闻，然后在早上10:56送给我")
    asyncio.run(team.run(5))

教程中RunSubscription不仅创建了订阅智能体代码，并启动了SubscriptionRunner，这会让我们的RunSubscription一直无法退出，这里尝试将二者分离，即从RunSubscription分离出AddSubscriptionTask的action，并且让SubscriptionRunner单独运行，交互流程如下图：

交互流程图

主要是重写SubscriptionAssistant这个Role，增加AddSubscriptionTask这个Action

2024-01-27 22:46:49.322 | INFO     | metagpt.const:get_metagpt_package_root:32 - Package root set to D:\workspace\sourcecode\MetaGPT
2024-01-27 22:46:49.487 | INFO     | metagpt.config:get_default_llm_provider_enum:124 - LLMProviderEnum.AZURE_OPENAI Model: gpt-35-turbo-1106
2024-01-27 22:46:49.488 | INFO     | metagpt.config:get_default_llm_provider_enum:126 - API: LLMProviderEnum.AZURE_OPENAI
2024-01-27 22:46:49.488 | DEBUG    | metagpt.config:_ensure_workspace_exists:228 - WORKSPACE_PATH set to D:\workspace\sourcecode\MetaGPT\workspace
2024-01-27 22:46:49.488 | DEBUG    | metagpt.config:__init__:85 - Config loading done.
2024-01-27 22:46:53.119 | INFO     | metagpt.config:get_default_llm_provider_enum:124 - LLMProviderEnum.AZURE_OPENAI Model: gpt-35-turbo-1106
2024-01-27 22:46:53.120 | INFO     | metagpt.config:get_default_llm_provider_enum:126 - API: LLMProviderEnum.AZURE_OPENAI
2024-01-27 22:46:53.162 | INFO     | metagpt.config:get_default_llm_provider_enum:124 - LLMProviderEnum.AZURE_OPENAI Model: gpt-35-turbo-1106
2024-01-27 22:46:53.162 | INFO     | metagpt.config:get_default_llm_provider_enum:126 - API: LLMProviderEnum.AZURE_OPENAI
2024-01-27 22:46:53.190 | DEBUG    | metagpt.environment:publish_message:108 - publish_message: {"id":"6a191b4bc60448acaad2228496ea802e","content":"从36kr创投平台https://pitchhub.36kr.com/financing-flash爬取所有初创企业融资的信息，获取标题，链接， 时间，总结今天的融资新闻，然后在早上10:56送给我","role":"Human","cause_by":"metagpt.actions.add_requirement.UserRequirement","sent_from":"","send_to":["<all>"]}
2024-01-27 22:46:53.192 | DEBUG    | metagpt.team:run:131 - max n_round=4 left.
2024-01-27 22:46:53.192 | DEBUG    | metagpt.roles.role:_observe:400 - Grace(Subscription Assistant) observed: ['Human: 从36kr创投平台https://pit...']
2024-01-27 22:46:53.193 | DEBUG    | metagpt.roles.role:_set_state:292 - actions=[ParseSubRequirement, RunSubscription, AddSubscriptionTask], state=0
2024-01-27 22:46:53.193 | DEBUG    | metagpt.roles.role:_react:431 - Grace(Subscription Assistant): self.rc.state=0, will do ParseSubRequirement
2024-01-27 22:46:53.194 | INFO     | __main__:_act:294 - Grace(Subscription Assistant): to do ParseSubRequirement
2024-01-27 22:46:53.218 | DEBUG    | metagpt.roles.role:run:482 - John(Crawling Engineer): no news. waiting.
[CONTENT]
{
    "Language": "zh_cn",
    "Cron Expression": "56 10 * * *",
    "Crawler URL List": [
        "https://pitchhub.36kr.com/financing-flash"
    ],
    "Page Content Extraction": "获取今天发布的所有文章的标题和内容。",
    "Crawl Post Processing": "生成今天新闻文章的摘要。",
    "Information Supplement": ""
}
[/CONTENT]
2024-01-27 22:46:55.621 | INFO     | metagpt.utils.cost_manager:update_cost:48 - Total running cost: .001 | Max budget: .000 | Current cost: .001, prompt_tokens: 493, completion_tokens: 100
2024-01-27 22:46:55.622 | DEBUG    | metagpt.actions.action_node:_aask_v1:269 - llm raw output:
[CONTENT]
{
    "Language": "zh_cn",
    "Cron Expression": "56 10 * * *",
    "Crawler URL List": [
        "https://pitchhub.36kr.com/financing-flash"
    ],
    "Page Content Extraction": "获取今天发布的所有文章的标题和内容。",
    "Crawl Post Processing": "生成今天新闻文章的摘要。",
    "Information Supplement": ""
}
[/CONTENT]
2024-01-27 22:46:55.626 | DEBUG    | metagpt.actions.action_node:_aask_v1:279 - parsed_data:
{'Language': 'zh_cn', 'Cron Expression': '56 10 * * *', 'Crawler URL List': ['https://pitchhub.36kr.com/financing-flash'], 'Page Content Extraction': '获取今天发布的所有文章的标题和内容。', 'Crawl Post Processing': '生成今天新闻文章的摘要。', 'Information Supplement': ''}
2024-01-27 22:46:55.626 | DEBUG    | metagpt.roles.role:_set_state:292 - actions=[ParseSubRequirement, RunSubscription, AddSubscriptionTask], state=-1
2024-01-27 22:46:55.628 | DEBUG    | metagpt.environment:publish_message:108 - publish_message: {"id":"561cbf7b752040ceb51d46b2d3782fae","content":"[CONTENT]\n{\n    \"Language\": \"zh_cn\",\n    \"Cron Expression\": \"56 10 * * *\",\n    \"Crawler URL List\": [\n        \"https://pitchhub.36kr.com/financing-flash\"\n    ],\n    \"Page Content Extraction\": \"获取今天发布的所有文章的标题和内容。\",\n    \"Crawl Post Processing\": \"生成今天新闻文章的摘要。\",\n    \"Information Supplement\": \"\"\n}\n[/CONTENT]","instruct_content":{"class":"ParseSubscriptionReq_AN","mapping":{"Language":"(<class 'str'>, Ellipsis)","Cron Expression":"(<class 'str'>, Ellipsis)","Crawler URL List":"(list[str], Ellipsis)","Page Content Extraction":"(<class 'str'>, Ellipsis)","Crawl Post Processing":"(<class 'str'>, Ellipsis)","Information Supplement":"(<class 'str'>, Ellipsis)"},"value":{"Language":"zh_cn","Cron Expression":"56 10 * * *","Crawler URL List":["https://pitchhub.36kr.com/financing-flash"],"Page Content Extraction":"获取今天发布的所有文章的标题和内容。","Crawl Post Processing":"生成今天新闻文章的摘要。","Information Supplement":""}},"role":"Grace(Subscription Assistant)","cause_by":"__main__.ParseSubRequirement","sent_from":"__main__.SubscriptionAssistant","send_to":["<all>"]}
2024-01-27 22:46:55.629 | DEBUG    | metagpt.environment:run:132 - is idle: False
2024-01-27 22:46:55.629 | DEBUG    | metagpt.team:run:131 - max n_round=3 left.
2024-01-27 22:46:55.630 | DEBUG    | metagpt.roles.role:run:482 - Grace(Subscription Assistant): no news. waiting.
2024-01-27 22:46:55.630 | DEBUG    | metagpt.roles.role:_observe:400 - John(Crawling Engineer) observed: ['Grace(Subscription Assistant): [CONTENT]\n{\n    "Lan...']
2024-01-27 22:46:55.630 | DEBUG    | metagpt.roles.role:_set_state:292 - actions=[WriteCrawlerCode], state=0
2024-01-27 22:46:55.631 | DEBUG    | metagpt.roles.role:_react:431 - John(Crawling Engineer): self.rc.state=0, will do WriteCrawlerCode
2024-01-27 22:46:55.631 | INFO     | metagpt.roles.role:_act:360 - John(Crawling Engineer): to do WriteCrawlerCode(WriteCrawlerCode)
To achieve the user requirement of retrieving the titles and content of all articles published today, we can use the following parse function:

\```python
from bs4 import BeautifulSoup

def parse(soup: BeautifulSoup):
    articles = soup.find_all('div', class_='css-xle9x')  # Find all divs containing articles
    today_articles = []  # List to store today's articles

    for article in articles:
        time = article.find('span', class_='time').get_text()  # Get the time of the article
        if '小时前' in time or '昨天' in time:  # Check if the article was published today
            title = article.find('a', class_='title').get_text()  # Get the title of the article
            content = article.find('div', class_='item-desc').get_text()  # Get the content of the article
            today_articles.append({'title': title, 'content': content})  # Add the article to the list

    return today_articles  # Return the list of today's articles
\```

This parse function first finds all the divs containing articles using the `find_all` method. Then, it iterates through each article, extracts the time, title, and content of the article, and checks if the article was published today. If the article was published today, it adds the title and content to a list. Finally, it returns the list of today's articles.

This implementation follows PEP8 standards, is modular, and easy to read and maintain. It efficiently retrieves the required information from the HTML page using BeautifulSoup.
2024-01-27 22:47:06.120 | INFO     | metagpt.utils.cost_manager:update_cost:48 - Total running cost: .009 | Max budget: .000 | Current cost: .009, prompt_tokens: 7949, completion_tokens: 334
2024-01-27 22:47:06.120 | DEBUG    | metagpt.roles.role:_set_state:292 - actions=[WriteCrawlerCode], state=-1
2024-01-27 22:47:06.121 | DEBUG    | metagpt.environment:publish_message:108 - publish_message: {"id":"a312de368d2d402a941ca400ca46d16b","content":"# https://pitchhub.36kr.com/financing-flash\nfrom bs4 import BeautifulSoup\n\ndef parse(soup: BeautifulSoup):\n    articles = soup.find_all('div', class_='css-xle9x')  # Find all divs containing articles\n    today_articles = []  # List to store today's articles\n\n    for article in articles:\n        time = article.find('span', class_='time').get_text()  # Get the time of the article\n        if '小时前' in time or '昨天' in time:  # Check if the article was published today\n            title = article.find('a', class_='title').get_text()  # Get the title of the article\n            content = article.find('div', class_='item-desc').get_text()  # Get the content of the article\n            today_articles.append({'title': title, 'content': content})  # Add the article to the list\n\n    return today_articles  # Return the list of today's articles\n","role":"Crawling Engineer","cause_by":"__main__.WriteCrawlerCode","sent_from":"__main__.CrawlerEngineer","send_to":["<all>"]}
2024-01-27 22:47:06.121 | DEBUG    | metagpt.environment:run:132 - is idle: False
2024-01-27 22:47:06.121 | DEBUG    | metagpt.team:run:131 - max n_round=2 left.
2024-01-27 22:47:06.122 | DEBUG    | metagpt.roles.role:_observe:400 - Grace(Subscription Assistant) observed: ['Crawling Engineer: # https://pitchhub.3...']
2024-01-27 22:47:06.122 | DEBUG    | metagpt.roles.role:_set_state:292 - actions=[ParseSubRequirement, RunSubscription, AddSubscriptionTask], state=1
2024-01-27 22:47:06.122 | DEBUG    | metagpt.roles.role:_react:431 - Grace(Subscription Assistant): self.rc.state=1, will do RunSubscription
2024-01-27 22:47:06.122 | INFO     | __main__:_act:294 - Grace(Subscription Assistant): to do RunSubscription
2024-01-27 22:47:06.124 | DEBUG    | __main__:create_sub_action_cls:228 - url='https://pitchhub.36kr.com/financing-flash'
module=<module 'e4e9a953132c4b4eb57d92a5f2b1a3de'>
2024-01-27 22:47:06.136 | INFO     | metagpt.config:get_default_llm_provider_enum:124 - LLMProviderEnum.AZURE_OPENAI Model: gpt-35-turbo-1106
2024-01-27 22:47:06.136 | INFO     | metagpt.config:get_default_llm_provider_enum:126 - API: LLMProviderEnum.AZURE_OPENAI
2024-01-27 22:47:06.253 | DEBUG    | metagpt.roles.role:_set_state:292 - actions=[ParseSubRequirement, RunSubscription, AddSubscriptionTask], state=2
2024-01-27 22:47:06.253 | INFO     | __main__:run:246 - I will run SubscriptionRunner
2024-01-27 22:47:06.254 | DEBUG    | metagpt.roles.role:run:482 - John(Crawling Engineer): no news. waiting.
2024-01-27 22:48:00.017 | DEBUG    | metagpt.roles.role:_observe:400 - () observed: ['user: ...']
2024-01-27 22:48:00.017 | DEBUG    | metagpt.roles.role:_set_state:292 - actions=[SubAction], state=0
2024-01-27 22:48:00.018 | DEBUG    | metagpt.roles.role:_react:431 - (): self.rc.state=0, will do SubAction
2024-01-27 22:48:00.018 | INFO     | metagpt.roles.role:_act:360 - (): to do SubAction(SubAction)
根据提供的上下文，今天的新闻摘要包括以下内容：

1. Synnovation Therapeutics获得1.02亿美元A轮融资，该公司正在开发小分子抗癌疗法。

2. 印度人工智能创企Krutrim完成5000万美元融资，成为印度首家AI独角兽。

3. 深圳「潜行创新」完成数千万元B+轮融资，深圳「易新能」完成数千万元首轮融资。

4. 光印网络科技完成Pre-A轮融资，资金将用于加强生活服务达人资源拓展、以AI为基础的数字化营销能力建设等。

5. 睿普康完成过亿元A轮融资，专注于卫星通信、蜂窝通信及电源管理芯片研发。

6. Accent Therapeutics获得7500万美元C轮融资，研发新型小分子精准癌症疗法。

7. HEPHAISTOS-Pharma获得200万欧元种子轮融资，研发下一代癌症免疫疗法。

8. Elephas Biosciences Corporation获得5500万美元C轮融资，开发肿瘤成像诊断平台。

9. 零跑汽车不再续聘吴保军，一周前刚获得6.59亿港元融资。

10. 「易新能」完成数千万元首轮融资，资金将主要用于新增产能所需的厂房、设备等项目的资本开支和配套流动资金，加速扩张储能、充电桩、数据等领域。

这些新闻涵盖了医疗科技、人工智能、网络科技和汽车行业的融资和发展动态。
: 根据提供的上下文，今天的新闻摘要包括以下内容：

1. Synnovation Therapeutics获得1.02亿美元A轮融资，该公司正在开发小分子抗癌疗法。

2. 印度人工智能创企Krutrim完成5000万美元融资，成为印度首家AI独角兽。

3. 深圳「潜行创新」完成数千万元B+轮融资，深圳「易新能」完成数千万元首轮融资。

4. 光印网络科技完成Pre-A轮融资，资金将用于加强生活服务达人资源拓展、以AI为基础的数字化营销能力建设等。

5. 睿普康完成过亿元A轮融资，专注于卫星通信、蜂窝通信及电源管理芯片研发。

6. Accent Therapeutics获得7500万美元C轮融资，研发新型小分子精准癌症疗法。

7. HEPHAISTOS-Pharma获得200万欧元种子轮融资，研发下一代癌症免疫疗法。

8. Elephas Biosciences Corporation获得5500万美元C轮融资，开发肿瘤成像诊断平台。

9. 零跑汽车不再续聘吴保军，一周前刚获得6.59亿港元融资。

10. 「易新能」完成数千万元首轮融资，资金将主要用于新增产能所需的厂房、设备等项目的资本开支和配套流动资金，加速扩张储能、充电桩、数据等领域。

这些新闻涵盖了医疗科技、人工智能、网络科技和汽车行业的融资和发展动态。

完整代码如下:

运行日志如下（为了看效果，打开了DEBUG打印模式，临时把SubscriptionRunner 的trigger定时时间修改为了"* * * * *"）

查看全文

"让Role变得通用"的思路

通过用LLM提取出html需要的信息

将SubscriptionRunner从RunSubscription中分离

相关文章：