Summertime Summarization
We'll follow these steps to achieve your goals:
Load and Clean Data: Handle different comment formats, including parsing HTML content.
Filter and Prioritize: Remove unwanted phrases and prioritize important phrases.
Summarize Comments: Group by RTSK number and create summaries for each group.
Structure Output: Ensure all comments for each RTSK number are grouped together on separate lines.
Here's a detailed Python script using pandas, BeautifulSoup for HTML parsing, and transformers for summarization:
Step-by-Step Script
Install Necessary Libraries:
shpip install pandas pip install beautifulsoup4 pip install transformersPython Code:
pythonimport pandas as pd from bs4 import BeautifulSoup from transformers import pipeline # Sample Data data = { 'RTSK Number': ['TASK0118575574', 'TASK0118575574', 'TASK0118575574', 'TASK0118575585', 'TASK0118575585', 'TASK0118575585'], 'RTSK Descr': ['Capacity Reservation', 'Capacity Reservation', 'Capacity Reservation', 'Add Monitoring - Unix', 'Add Monitoring - Unix', 'Add Monitoring - Unix'], 'RTSK Worknote': [ 'Comment from automation', 'Comment from manually written', '<h2>html content need parser</h2><p>This is HTML content.</p>', 'Comment from automation', 'Comment from manually written', '<h2>html content need parser</h2><p>This is HTML content.</p>' ] } df = pd.DataFrame(data) # Function to parse HTML content def parse_html(content): if '<' in content and '>' in content: soup = BeautifulSoup(content, 'html.parser') return soup.get_text(separator=" ", strip=True) return content # Function to filter unwanted phrases def filter_comments(text, unwanted_phrases): sentences = text.split('. ') filtered = [sentence for sentence in sentences if all(phrase not in sentence for phrase in unwanted_phrases)] return '. '.join(filtered) # Function to prioritize important phrases def prioritize_comments(text, keywords): sentences = text.split('. ') prioritized = [sentence for sentence in sentences if any(keyword in sentence for keyword in keywords)] remaining = [sentence for sentence in sentences if all(keyword not in sentence for keyword in keywords)] return '. '.join(prioritized + remaining) # Initialize summarization pipeline summarizer = pipeline("summarization", model="facebook/bart-large-cnn") # List of unwanted phrases unwanted_phrases = ["Integrator Record Response", "Function Name: UPDATE_SYSTEM"] keywords = ["Automation Failed"] # Group by RTSK Number grouped = df.groupby(['RTSK Number']) summaries = [] for name, group in grouped: combined_notes = ' '.join(group['RTSK Worknote'].apply(parse_html).tolist()) filtered_notes = filter_comments(combined_notes, unwanted_phrases) prioritized_notes = prioritize_comments(filtered_notes, keywords) summary = summarizer(prioritized_notes, max_length=130, min_length=30, do_sample=False)[0]['summary_text'] summaries.append({ 'RTSK Number': name, 'RTSK Descr': group['RTSK Descr'].iloc[0], 'Summary': summary, 'All Comments': '\n'.join(group['RTSK Worknote']) }) summary_df = pd.DataFrame(summaries) print(summary_df) # Save the summary to a CSV file summary_df.to_csv('summarized_worknotes.csv', index=False) print("Summarization complete. Check the summarized_worknotes.csv file.")
Explanation:
Parse HTML Content: The
parse_htmlfunction uses BeautifulSoup to extract text from HTML content.Filter and Prioritize: The
filter_commentsfunction removes unwanted phrases, and theprioritize_commentsfunction brings important phrases to the top.Summarize: Using the
transformerslibrary's BART model, the script summarizes the grouped comments.Output: The script structures the summary and all comments, ensuring all comments for each RTSK number are grouped together on separate lines.
This solution integrates everything you need to handle various comment formats and produce structured summaries. If you have any more specific requirements or need further customization, let me know!
Got it! If you need a summarization model that can be used offline, you can consider using SummerTime, an open-source text summarization toolkit. It supports various models like TextRank, BART, Longformer, and more. Here's how you can set it up:
Step-by-Step Script
Install SummerTime:
shpip install summertimePython Code:
pythonfrom summertime import model from summertime.models import BartModel, TextRankModel from summertime.utils import clean_text # Sample Data data = { 'RTSK Number': ['TASK0118575574', 'TASK0118575574', 'TASK0118575574', 'TASK0118575585', 'TASK0118575585', 'TASK0118575585'], 'RTSK Descr': ['Capacity Reservation', 'Capacity Reservation', 'Capacity Reservation', 'Add Monitoring - Unix', 'Add Monitoring - Unix', 'Add Monitoring - Unix'], 'RTSK Worknote': [ 'Comment from automation', 'Comment from manually written', '<h2>html content need parser</h2><p>This is HTML content.</p>', 'Comment from automation', 'Comment from manually written', '<h2>html content need parser</h2><p>This is HTML content.</p>' ] } df = pd.DataFrame(data) # Function to parse HTML content def parse_html(content): if '<' in content and '>' in content: soup = BeautifulSoup(content, 'html.parser') return soup.get_text(separator=" ", strip=True) return content # Initialize summarization model summarizer = BartModel() # Group by RTSK Number grouped = df.groupby(['RTSK Number']) summaries = [] for name, group in grouped: combined_notes = ' '.join(group['RTSK Worknote'].apply(parse_html).tolist()) cleaned_notes = clean_text(combined_notes) summary = summarizer.summarize(cleaned_notes, num_sentences=3) summaries.append({ 'RTSK Number': name, 'RTSK Descr': group['RTSK Descr'].iloc[0], 'Summary': summary, 'All Comments': '\n'.join(group['RTSK Worknote']) }) summary_df = pd.DataFrame(summaries) print(summary_df) # Save the summary to a CSV file summary_df.to_csv('summarized_worknotes.csv', index=False) print("Summarization complete. Check the summarized_worknotes.csv file.")
Explanation:
SummerTime: This library provides various summarization models that can be used offline.
HTML Parsing: The
parse_htmlfunction extracts text from HTML content.Summarization: The
BartModelfrom SummerTime is used to generate summaries.Output: The script groups comments by RTSK number and provides structured summaries.
This approach should work well within your organization's restrictions. If you need further customization or have any questions, feel free to ask!
Both SummerTime and Sumy are excellent choices for text summarization, but they have different strengths:
SummerTime:
Model Variety: Supports a wide range of models, including BART, Longformer, and TextRank.
Ease of Use: Designed for non-experts with easy-to-use APIs and visualizations.
Flexibility: Can handle various summarization tasks like query-based, multi-document, and dialogue summarization.
Offline Use: Can be used offline without needing internet access.
Sumy:
Simplicity: Easier to set up and use for basic summarization tasks.
Lightweight: Smaller library with fewer dependencies.
Performance: Good for quick and straightforward summarization needs.
Which to Choose?
If you need a more comprehensive toolkit with a variety of models and tasks, SummerTime might be the better choice.
If you prefer a simpler, lightweight solution for basic summarization, Sumy could be more suitable.
No comments