Sumy New Summarization
Sure! Let's use the Sumy library to summarize your worknotes. Below is the code to handle HTML content, manually written comments, and comments from automated tasks. This code will parse the HTML, filter out unwanted phrases, prioritize important phrases, and then summarize the grouped worknotes.
Step-by-Step Script
Install Necessary Libraries:
shpip install pandas pip install beautifulsoup4 pip install sumyPython Code:
pythonimport pandas as pd from bs4 import BeautifulSoup from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer from sumy.summarizers.lsa import LsaSummarizer # Sample Data data = { 'RTSK Number': ['TASK0118575574', 'TASK0118575574', 'TASK0118575574', 'TASK0118575585', 'TASK0118575585', 'TASK0118575585'], 'RTSK Descr': ['Capacity Reservation', 'Capacity Reservation', 'Capacity Reservation', 'Add Monitoring - Unix', 'Add Monitoring - Unix', 'Add Monitoring - Unix'], 'RTSK Worknote': [ 'Comment from automation', 'Comment from manually written', '<h2>html content need parser</h2><p>This is HTML content.</p>', 'Comment from automation', 'Comment from manually written', '<h2>html content need parser</h2><p>This is HTML content.</p>' ] } df = pd.DataFrame(data) # Function to parse HTML content def parse_html(content): if '<' in content and '>' in content: soup = BeautifulSoup(content, 'html.parser') return soup.get_text(separator=" ", strip=True) return content # Function to filter unwanted phrases def filter_comments(text, unwanted_phrases): sentences = text.split('. ') filtered = [sentence for sentence in sentences if all(phrase not in sentence for phrase in unwanted_phrases)] return '. '.join(filtered) # Function to prioritize important phrases def prioritize_comments(text, keywords): sentences = text.split('. ') prioritized = [sentence for sentence in sentences if any(keyword in sentence for keyword in keywords)] remaining = [sentence for sentence in sentences if all(keyword not in sentence for keyword in keywords)] return '. '.join(prioritized + remaining) # Function to remove duplicate lines def remove_duplicates(text):sentences = text.split('\n')unique_sentences = list(dict.fromkeys(sentences))return '. '.join(unique_sentences)# Function to summarize text using Sumy def summarize_text(text, num_sentences=3): parser = PlaintextParser.from_string(text, Tokenizer("english")) summarizer = LsaSummarizer() summary = summarizer(parser.document, num_sentences) return ' '.join(str(sentence) for sentence in summary) # List of unwanted phrases unwanted_phrases = ["Integrator Record Response", "Function Name: UPDATE_SYSTEM"] keywords = ["Automation Failed"] # Group by RTSK Number grouped = df.groupby(['RTSK Number']) summaries = [] for name, group in grouped: combined_notes = ' '.join(group['RTSK Worknote'].apply(parse_html).tolist()) filtered_notes = filter_comments(combined_notes, unwanted_phrases) prioritized_notes = prioritize_comments(filtered_notes, keywords) summary = summarize_text(prioritized_notes, num_sentences=5) summaries.append({ 'RTSK Number': name, 'RTSK Descr': group['RTSK Descr'].iloc[0], 'Summary': summary, 'All Comments': '\n'.join(group['RTSK Worknote']) }) summary_df = pd.DataFrame(summaries) print(summary_df) # Save the summary to a CSV file summary_df.to_csv('summarized_worknotes.csv', index=False) print("Summarization complete. Check the summarized_worknotes.csv file.")
Explanation:
HTML Parsing: The
parse_htmlfunction uses BeautifulSoup to extract text from HTML content.Filter and Prioritize: The
filter_commentsfunction removes unwanted phrases, and theprioritize_commentsfunction brings important phrases to the top.Summarization: The
summarize_textfunction uses Sumy's LSA (Latent Semantic Analysis) summarizer to create summaries.Output: The script groups comments by RTSK number and provides structured summaries, ensuring all comments for each RTSK number are grouped together on separate lines.
This solution integrates everything needed to handle various comment formats and produce structured summaries. If you need more specific customizations or assistance, let me know!
No comments