Sumy New Summarization

Sure! Let's use the Sumy library to summarize your worknotes. Below is the code to handle HTML content, manually written comments, and comments from automated tasks. This code will parse the HTML, filter out unwanted phrases, prioritize important phrases, and then summarize the grouped worknotes.

Step-by-Step Script

Install Necessary Libraries:

pip install pandas
pip install beautifulsoup4
pip install sumy

Python Code:

python

import pandas as pd
from bs4 import BeautifulSoup
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

# Sample Data
data = {
    'RTSK Number': ['TASK0118575574', 'TASK0118575574', 'TASK0118575574', 'TASK0118575585', 'TASK0118575585', 'TASK0118575585'],
    'RTSK Descr': ['Capacity Reservation', 'Capacity Reservation', 'Capacity Reservation', 'Add Monitoring - Unix', 'Add Monitoring - Unix', 'Add Monitoring - Unix'],
    'RTSK Worknote': [
        'Comment from automation',
        'Comment from manually written',
        '<h2>html content need parser</h2><p>This is HTML content.</p>',
        'Comment from automation',
        'Comment from manually written',
        '<h2>html content need parser</h2><p>This is HTML content.</p>'
    ]
}

df = pd.DataFrame(data)

# Function to parse HTML content
def parse_html(content):
    if '<' in content and '>' in content:
        soup = BeautifulSoup(content, 'html.parser')
        return soup.get_text(separator=" ", strip=True)
    return content

# Function to filter unwanted phrases
def filter_comments(text, unwanted_phrases):
    sentences = text.split('. ')
    filtered = [sentence for sentence in sentences if all(phrase not in sentence for phrase in unwanted_phrases)]
    return '. '.join(filtered)

# Function to prioritize important phrases
def prioritize_comments(text, keywords):
    sentences = text.split('. ')
    prioritized = [sentence for sentence in sentences if any(keyword in sentence for keyword in keywords)]
    remaining = [sentence for sentence in sentences if all(keyword not in sentence for keyword in keywords)]
    return '. '.join(prioritized + remaining)

# Function to remove duplicate lines
def remove_duplicates(text):
sentences = text.split('\n')
unique_sentences = list(dict.fromkeys(sentences))
return '. '.join(unique_sentences)







# Function to summarize text using Sumy
def summarize_text(text, num_sentences=3):
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    summarizer = LsaSummarizer()
    summary = summarizer(parser.document, num_sentences)
    return ' '.join(str(sentence) for sentence in summary)

# List of unwanted phrases
unwanted_phrases = ["Integrator Record Response", "Function Name: UPDATE_SYSTEM"]
keywords = ["Automation Failed"]

# Group by RTSK Number
grouped = df.groupby(['RTSK Number'])

summaries = []
for name, group in grouped:
    combined_notes = ' '.join(group['RTSK Worknote'].apply(parse_html).tolist())
    filtered_notes = filter_comments(combined_notes, unwanted_phrases)
    prioritized_notes = prioritize_comments(filtered_notes, keywords)
    summary = summarize_text(prioritized_notes, num_sentences=5)
    summaries.append({
        'RTSK Number': name,
        'RTSK Descr': group['RTSK Descr'].iloc[0],
        'Summary': summary,
        'All Comments': '\n'.join(group['RTSK Worknote'])
    })

summary_df = pd.DataFrame(summaries)
print(summary_df)

# Save the summary to a CSV file
summary_df.to_csv('summarized_worknotes.csv', index=False)
print("Summarization complete. Check the summarized_worknotes.csv file.")

Explanation:

HTML Parsing: The parse_html function uses BeautifulSoup to extract text from HTML content.
Filter and Prioritize: The filter_comments function removes unwanted phrases, and the prioritize_comments function brings important phrases to the top.
Summarization: The summarize_text function uses Sumy's LSA (Latent Semantic Analysis) summarizer to create summaries.
Output: The script groups comments by RTSK number and provides structured summaries, ensuring all comments for each RTSK number are grouped together on separate lines.

This solution integrates everything needed to handle various comment formats and produce structured summaries. If you need more specific customizations or assistance, let me know!

Data Science & Machine Learning

Sumy New Summarization

Step-by-Step Script

Explanation:

No comments

Popular

Recent

Comments

Search This Blog

Blog Archive

About Us

Recent in Spirituality

Recent in Tourism

Popular Posts