Summarization Sumy

 

  1. Filtering out unwanted phrases before summarization.
  2. Prioritizing specific comments containing phrases like "Automation Failed" or "Automation Success."

Here’s the updated code with comments for each change:

python
import pandas as pd from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer from sumy.summarizers.lsa import LsaSummarizer import re # Define unwanted phrases and important keywords unwanted_phrases = ["<h2>Parameters:</h3><br/>", "<h2>Logs:</h2><br/>"] # Add more unwanted phrases as needed priority_phrases = ["Automation Failed", "Automation Success"] # Function to filter out unwanted phrases def filter_comments(text, unwanted_phrases): for phrase in unwanted_phrases: text = text.replace(phrase, "") return text # Function to prioritize specific comments by bringing important phrases to the top def prioritize_comments(text): lines = text.split('. ') prioritized = [line for line in lines if any(phrase in line for phrase in priority_phrases)] non_prioritized = [line for line in lines if line not in prioritized] return '. '.join(prioritized + non_prioritized) # Sumy summarization function def summarize_with_sumy(text, num_sentences=2): parser = PlaintextParser.from_string(text, Tokenizer("english")) summarizer = LsaSummarizer() summary = summarizer(parser.document, num_sentences) return ' '.join(str(sentence) for sentence in summary) # Load your dataset # df = pd.read_csv('your_data.csv') # Uncomment and set the path to your dataset # Group by 'RTSK Short Desc' and 'Automation Execution Status' grouped = df.groupby(['RTSK Short Desc', 'Automation Execution Status']) summaries = [] for name, group in grouped: combined_notes = '. '.join(group['RTSK Worknote'].tolist()) filtered_notes = filter_comments(combined_notes, unwanted_phrases) summary = summarize_with_sumy(filtered_notes, num_sentences=5) customized_summary = prioritize_comments(summary) summaries.append({ 'RTSK Short Desc': name[0], 'Automation Execution Status': name[1], 'Summary': customized_summary }) # Convert summaries to DataFrame and save to CSV summary_df = pd.DataFrame(summaries) summary_df.to_csv('summarized_worknotes.csv', index=False) print("Summarization complete. Check the summarized_worknotes.csv file.")

Explanation of Key Functions

  1. filter_comments: Removes unwanted phrases from each RTSK Worknote.
  2. prioritize_comments: Identifies sentences with priority phrases and moves them to the top in the final summary.
  3. summarize_with_sumy: Uses Sumy to generate a concise summary of the filtered worknotes for each group.

Additional Notes

  • Adjust num_sentences in summarize_with_sumy based on the summary length needed.
  • The priority_phrases list can be expanded with more keywords as required.

No comments

Theme images by tjasam. Powered by Blogger.