使用langchain调用GPT进行双语对齐

本次使用langchain调用chatgpt来进行文本对齐最关键的地方就在于提示词的写法，在学习了格式化输出，解析输出之后，我就开始想到了使用gpt进行对齐。

说干就干!!!💪

关于调用langchain的一些基本用法，我是观看的吴恩达老师的网课，讲的很清晰，同时还有网页的jupyter能够配套运行，免受配环境的苦，但是在自己电脑上运行的时候，总会出问题😭

使用gpt对齐，原理比较说起来比较简单(但是实际操作起来还是出了不少问题)：

第一步：读取doc文档
第二步：调用API，进行格式化输出
第三步：输出为excel文档。

难点就在第二步：

调用gpt进行对齐。
使用ResponseSchema和StructuredOutputParser来格式化gpt输出格式，并解析。

第一次尝试

在我看来，实现对齐应该分为两步，第一步先分句，第二步再对齐。

一开始我将这两步都用gpt来完成，通过提示词，来让gpt先分句再进行对齐，于是我第一次的提示词如下

template = """I want you to act as an translator who good at\
English and Chinese. I will give you a text in which there one Chinese\
paragraph and one English paragraph. I need you to seperate the English\
paragraph into sentences, and also seperate the Chinese paragras into\
sentences according to the meaning of each English sentence. 

Remember that not change any word of paragpraphs but only seperate the paragraphs.

{format_instructions}

text: {text}"""

这时候format_instructions是这样的，一开始就是想到的放入到两个列表中去，其实一开始我也想到了使用json，但是我觉得直接放入

Chinese_sentences= ResponseSchema(name="Chinese_sentences", type = 'list', description = 'add each Chiense sentences into this list')
English_sentences= ResponseSchema(name="English_sentences", type = 'list',description = 'add each English sentence into this list')response_schemas = [Chinese_sentences,English_sentences]
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
format_instructions = output_parser.get_format_instructions()

这是输入的提示词，因为一开始我的思路就是，仿照前面那位同学的思路，来进行提示词的书写。最后这个提示词得到的结果还不错，基本是100%的正确率。

但是后面在改变对齐的文本时就出现问题了。

因为一开始我的需求是对齐自己译的英译汉文本，英文原文是我导布置的任务，要求译完之后，给一个句句对其的excel文本给她。这是我一开始的需求，由于我在使用gpt进行翻译的译后编辑的时候，我就已经使用gpt将翻译的中文句子置于英文句子之后了，导致我本来的文本其实基本就是格式比较工整的。所以一开始使用这段提示词的时候，输出的excel文档对齐的效果十分不错。

之后我的同学让我试一试普通的中英文分开的文本，于是问题就出现了。

第二次尝试

再使用这样的提示词，我就发现，gpt并不能完全将句子完全分开，我是让gpt以中文为主，来对齐英文，就出现了中文文本gpt没办法完全句句对齐，各种提示词都写了好几遍，类似以下这种，还有一些其他的，总之写的已经很详细了。

template = """你现在是一个中英文本对齐师，现在你需要将中英文分句后按照句意对齐，办法是这样：\
    首先将中英文分句并分别排序，然后拿出中文的第一句，检测英文第2句句意是否在中文的第一句中。\
    如有，则将英文的1、2句一起与中文的第一句对齐，如没有，则只将英文的第一句与中文的第一句对齐；\
    然后检测中文的第二句句意是否在英文的第一句中。如有，则将中文的1、2句一起与英文的第一句对齐，如没有，\
    则只将中文的第一句与英文的第一句对齐，以此类推，将我发给你的所有文本按照句意对齐。

{format_instructions}

text: {text}"""

而且我在gpt的网页中实验了这些提示词，也是行得通的，分出来的句子也是比较细致，而且准确率比较高。但是不知道为什么，通过调用api的方式，出来结果就是不行，我猜想是可能是受到了format_instructions的一些影响。

我我一开始仍然是按照之前的ResponseSchema来固定输出，我就怀疑是不是，我最后让她将输出结果输出到两个列表中，影响了gpt对齐。于是我开始思考干脆直接让他将中英文句子对，输出为键值对，放到一个json格式中。于是这时format_instructions就改为了以下这样：

1
2
3

aligned_sentences= ResponseSchema(name="aligned_sentences", 
    description = 'add each Chiense and English sentence pair into here as a key-value pair')
response_schemas = [aligned_sentences]

这样之后呢，输出的结果要好一点，但是仍然不能完全将句子分开，达不到效果，还没有abbyy aligner分得好。

第三次尝试

我就开始思考，怎么写提示词才能让gpt严格将段落分为句子。我甚至还写到，让它按照标点符号来将中文句子分开，但是它仍然分不开。在写到标点符号的时候u，我突然就想到了，既然我要让他把中文句子分开，我直接使用其他分句的手段，将句子分开，不就好了吗？然后再让gpt去英文中，把一个个中文句子的英文找出来。感觉可行，于是就使用jieba分句，然后问问gpt怎么分，它直接就告诉了我以下代码来分句：

def split_sentences(text):
    components = re.split('([。！？.!?:：])', text)  # 使用正则表达式分割句子
    sentences = ["".join(components[i:i+2]) for i in range(0, len(components)-1, 2)]  # 将句子和标点符号重新组合
    return sentences

接下里呢就继续修改提示词

template = """你现在是一个中英文本对齐师，你将会得到一个含有一个个中文句子的列表和一个英文段落，现在你需要将这一个个中文句子的英文翻译在英文段落中找出来\
并将它们对齐为一个句子对。比如，你先从取第一个中文句子，按照这个中文句子的句意，从英文段落开头开始匹配，将和这句中文句子相匹配的英文句子拆分出来，同这句中文句子对齐。\
然后将这一个中文句子和英文句子对齐为一个句子对即可。依次对每个中文句子都执行这一对齐操作。直到对齐完所有的中文句子。

{format_instructions}

text: {text}"""

最终的出来的很满意，出来的结果比aligner的更准确，gpt完全能够将每句中文的对应的英文在文中找出来。以下是部分对齐的结果：

'```json\n{\n    "aligned_sentences": {\n        "第二，我们必须复苏经济，推动实现更加强劲、绿色、健康的全球发展。": "Second, we must revitalize the economy and promote a more robust, green, and healthy global development.",\n        "发展是实现人民幸福的关键。": "Development is the key to achieving the well-being of the people.",\n        "面对疫情带来的严重冲击，我们要共同推动全球发展迈向平衡协调包容新阶段。": "Faced with the severe impact of the pandemic, we must jointly propel global development towards a new stage of balance, coordination, and inclusiveness.",\n        "在此，我愿提出全球发展倡议：": "Here, I propose a Global Development Initiative:",\n        "——坚持发展优先。": "- Adhere to development priority:",\n        "将发展置于全球宏观政策框架的突出位置，加强主要经济体政策协调，保持连续性、稳定性、可持续性，构建更加平等均衡的全球发展伙伴关系，推动多边发展合作进程协同增效，加快落实联合国2030年可持续发展议程。": "Place development in a prominent position within the global macro-policy framework, strengthen policy coordination among major economies, maintain continuity, stability, and sustainability, build a more equal and balanced global development partnership, promote the coordinated and efficient progress of multilateral development cooperation processes, and accelerate the implementation of the United Nations 2030 Agenda for Sustainable Development.",\n        "——坚持以人民为中心。": "- Adhere to people-centered development:",\n        "在发展中保障和改善民生，保护和促进人权，做到发展为了人民、发展依靠人民、发展成果由人民共享，不断增强民众的幸福感、获得感、安全感，实现人的全面发展。": "In the course of development, ensure and improve people\'s well-being, protect and promote human rights, ensure that development is for the people, relies on the people, and the benefits of development are shared by the people, constantly enhance the happiness, sense of gain, and sense of security of the people, and achieve comprehensive human development.",\n        "——坚持普惠包容。": "- Adhere to inclusive development:",\n        "关注发展中国家特殊需求，通过缓债、发展援助等方式支持发展中国家尤其是困难特别大的脆弱国家，着力解决国家间和各国内部发展不平衡、不充分问题。": "Pay attention to the special needs of developing countries, support developing countries, especially those facing particular difficulties, with debt relief, development assistance, and other means, focus on addressing the issues of imbalances and inadequacies in development among and within countries.",\n        "——坚持创新驱动。": "- Adhere to innovation-driven development:",\n        "抓住新一轮科技革命和产业变革的历史性机遇，加速科技成果向现实生产力转化，打造开放、公平、公正、非歧视的科技发展环境，挖掘疫后经济增长新动能，携手实现跨越发展。": "Seize the historic opportunities of the new round of technological revolution and industrial transformation, accelerate the transformation of scientific achievements into real productivity, create an open, fair, just, and non-discriminatory environment for technological development, explore new impetus for post-pandemic economic growth, and work together to achieve leapfrog development.",\n        "——坚持人与自然和谐共生。": "- Adhere to harmonious coexistence between humans and nature:",\n        "完善全球环境治理，积极应对气候变化，构建人与自然生命共同体。": "Improve global environmental governance, actively address climate change, and build a community of life for all living things.",\n        "加快绿色低碳转型，实现绿色复苏发展。": "Accelerate the green and low-carbon transformation, achieve green recovery and development.",\n  ......}\n}\n```

甚至还试了一下如果我将文中的某英文给删除掉，看他最终出来的结果会怎么样。最终发现，gpt会自己翻译一句话，把这句话给补上，不过如果英文句子删多了，他对齐的效果就没那么好了。

总的来说，我觉得使用gpt来对齐，是完全可行的，只是需要不断完善，做到这这里其实也有还有一些问题没有实现:

如果上传的文档字数太多的话，token会超出限制，目前我也没想到解决的办法。
如何保证gpt不会修改原文，这个问题我觉得可以在提示词上再优化
有时候中文的一个句子也比较长，包含了好几个意群，如何能够进一步拆分

这是目前我能够想到的一些问题。

一点小感悟

但我觉得，完成这样一个小的程序编写，我已经学会了很多。在做中学，在做中思，做着做着，就会发现不足，就会找到问题，慢慢的逐渐完善起来了。从需求出发，我觉得才有做的动力，才能解决实际的问题，最后我将这个程序使用streamlit写了一个简陋的网页，来完成拖拽上传，自动下载excel。不论这个程序完善与否，至少满足了我一开始的需求。

以下是我的所有代码，供大家参考，如果有好的想法，可以一起交流😀

import os
from langchain.output_parsers import StructuredOutputParser, ResponseSchema
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.chat_models import ChatOpenAI
import openai
import docx
import re
import pandas as pd
from itertools import zip_longest
import jieba
import re
import json

# load the document
doc = docx.Document('21-习近平 机器翻译 AI.docx')
chinese_paragraphs = []
english_paragraphs = []

for paragraph in doc.paragraphs:
    text = paragraph.text.strip()
    if re.search('[\u4e00-\u9fff]', text):  # 匹配中文字符
        chinese_paragraphs.append(text)
    else:
        english_paragraphs.append(text)

#connect each line into paragraphs
chinese_paragraphs = ''.join(chinese_paragraphs)
english_paragraphs = ''.join(english_paragraphs)

def split_sentences(text):
    components = re.split('([。！？.!?:：])', text)  # 使用正则表达式分割句子
    sentences = ["".join(components[i:i+2]) for i in range(0, len(components)-1, 2)]  # 将句子和标点符号重新组合
    return sentences

# connect the Chinese sentence list and the English paragraph as input text
Chinese_sentences = split_sentences(chinese_paragraphs)
input_text = str(Chinese_sentences)+'\n'+english_paragraphs



# load the api_key
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']


# account for deprecation of LLM model
import datetime
current_date = datetime.datetime.now().date()
target_date = datetime.date(2024, 6, 12)
if current_date > target_date:
    llm_model = "gpt-3.5-turbo"
else:
    llm_model = "gpt-3.5-turbo-0301"

# structured output template
aligned_sentences= ResponseSchema(name="aligned_sentences", 
    description = 'add each Chiense and English sentence pair into here as a key-value pair')
response_schemas = [aligned_sentences]

# output_parser to parse gpt's response
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)

# format the structured output template
format_instructions = output_parser.get_format_instructions()
format_instructions

# prompt template
template = """你现在是一个中英文本对齐师，你将会得到一个含有一个个中文句子的列表和一个英文段落，现在你需要将这一个个中文句子的英文翻译在英文段落中找出来\
并将它们对齐为一个句子对。比如，你先从取第一个中文句子，按照这个中文句子的句意，从英文段落开头开始匹配，将和这句中文句子相匹配的英文句子拆分出来，同这句中文句子对齐。\
然后将这一个中文句子和英文句子对齐为一个句子对即可。依次对每个中文句子都执行这一对齐操作。直到对齐完所有的中文句子。

{format_instructions}

text: {text}"""


align_prompt_template = ChatPromptTemplate.from_template(template)

# detailed prompt
align_prompt = align_prompt_template.format_messages(text = input_text, format_instructions=format_instructions)

#call gpt
chat = ChatOpenAI(temperature = 0.7, model = llm_model)

response = chat(align_prompt)

# use parser to parse the structured output
output_dict = output_parser.parse(response.content)['aligned_sentences']
type(output_dict)

# convert json into DataFrame with keys and values as two columns
df = pd.DataFrame.from_dict(output_dict, orient='index').reset_index()

# name two columns
df.columns = ['中文', '英文']

# write DataFrame into a Excel file
df.to_excel("output_1.xlsx", index=False)

以下是streamlit网页的代码：

因为我之前的提示词，完全能够满足我的需求，我需要对齐的文档，中英字数也没有超过token的限制，所以我使用的之前的提示词模板。

streamlit中的代码主要是完善了一些小的细节：

自动命名excel的文件
在页面上显示提取的中英文本
使用zip防止excel单元格为空报错
添加访问的代理
可以填入不同的api_key

import os
from langchain.output_parsers import StructuredOutputParser, ResponseSchema
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.chat_models import ChatOpenAI
import openai
from docx import Document
import re
import pandas as pd
from itertools import zip_longest
import streamlit as st
from dotenv import load_dotenv, find_dotenv

st.set_page_config(page_title="Bilingual Alignment", page_icon="📖")
st.markdown("# Bilingual Alignment")
st.sidebar.header("Bilingual Alignment")
st.write(
    """
    You can upload your bilingual doc(zh_EN or en_US). 
    This app will help you to align Chinese and English text and the last output an excel file
"""
)

os.environ['HTTP_PROXY'] = 'http:***'
os.environ['HTTPS_PROXY'] = 'http:***'

# account for deprecation of LLM model
import datetime
current_date = datetime.datetime.now().date()
target_date = datetime.date(2024, 6, 12)
if current_date > target_date:
    llm_model = "gpt-3.5-turbo"
else:
    llm_model = "gpt-3.5-turbo-0301"

def extract_paragraphs(docx_file):
    #extract text from the document
    doc = Document(docx_file)
    chinese_paragraphs = []
    english_paragraphs = []

    for paragraph in doc.paragraphs:
        text = paragraph.text.strip()
        if re.search('[\u4e00-\u9fff]', text):  # 匹配中文字符
            chinese_paragraphs.append(text)
        else:
            english_paragraphs.append(text)
    chinese_paragraphs = ''.join(chinese_paragraphs)
    english_paragraphs = ''.join(english_paragraphs)
    input_text = english_paragraphs + '\n' + chinese_paragraphs
    
    return input_text

def prompt_template():
    # Define the prompt template
    Chinese_sentences= ResponseSchema(name="Chinese_sentences", type = 'list', 
                                        description = 'add each Chiense sentences into this list')
    English_sentences= ResponseSchema(name="English_sentences", type = 'list',
                                        description = 'add each English sentence into this list')
    response_schemas = [Chinese_sentences,English_sentences]
    output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
    format_instructions = output_parser.get_format_instructions()
    
    template = """
    你现在是一个中英文本对齐师，现在你需要将中英文分句后按照句意对齐，办法是这样：\
    首先将中英文分句并分别排序，然后拿出中文的第一句，检测英文第2句句意是否在中文的第一句中。\
    如有，则将英文的1、2句一起与中文的第一句对齐，如没有，则只将英文的第一句与中文的第一句对齐；\
    然后检测中文的第二句句意是否在英文的第一句中。如有，则将中文的1、2句一起与英文的第一句对齐，如没有，\
    则只将中文的第一句与英文的第一句对齐，以此类推，将我发给你的所有文本按照句意对齐。

    {format_instructions}

    text: {text}"""
    return template, format_instructions, output_parser

def call_gpt(template,input_text,format_instructions,output_parser):
    _ = load_dotenv(find_dotenv()) # read local .env file
    openai.api_key = os.environ['OPENAI_API_KEY']
    #call gpt to answer  
    align_prompt_template = ChatPromptTemplate.from_template(template)
    align_prompt = align_prompt_template.format_messages(text = input_text,
                                                        format_instructions=format_instructions)
    chat = ChatOpenAI(temperature = 0.7, model = llm_model)
    response = chat(align_prompt)
    print(response.content)
    # Parse the response and output to a excel file

    Chinese_sentence_list = output_parser.parse(response.content)["Chinese_sentences"]
    English_sentence_list = output_parser.parse(response.content)["English_sentences"]

    return Chinese_sentence_list, English_sentence_list

def write_excel(Chinese_sentence_list, English_sentence_list, base_filename):
    st.write("正在写入excel文件")
    zipped = zip_longest(English_sentence_list, Chinese_sentence_list)
    # create a DataFrame 
    df = pd.DataFrame(zipped, columns = ["英文","中文"])
    # write it into a excel file
    excel_filename = f'{base_filename}_output.xlsx'
    df.to_excel(excel_filename, index=False)


def main():
    uploaded_file = st.file_uploader("上传中英文双语Word文档", type=["docx"])
    if uploaded_file is not None:
        # get the name of the uploaded file
        uploaded_filename = uploaded_file.name
        # remove the extension of the name
        base_filename = os.path.splitext(uploaded_filename)[0]
        input_text = extract_paragraphs(uploaded_file)
        st.write("中英文本")
        st.write(input_text)
        template, format_instructions, output_parser = prompt_template()
        Chinese_sentence_list, English_sentence_list = call_gpt(template,input_text,format_instructions,output_parser)
        write_excel(Chinese_sentence_list, English_sentence_list, base_filename)

if __name__ == "__main__":
    main()