Sumy New Summarization

 Sure! Let's use the Sumy library to summarize your worknotes. Below is the code to handle HTML content, manually written comments, and comments from automated tasks. This code will parse the HTML, filter out unwanted phrases, prioritize important phrases, and then summarize the grouped worknotes.

Step-by-Step Script

  1. Install Necessary Libraries:

    sh
    pip install pandas
    pip install beautifulsoup4
    pip install sumy
    
  2. Python Code:

    python
    import pandas as pd
    from bs4 import BeautifulSoup
    from sumy.parsers.plaintext import PlaintextParser
    from sumy.nlp.tokenizers import Tokenizer
    from sumy.summarizers.lsa import LsaSummarizer
    
    # Sample Data
    data = {
        'RTSK Number': ['TASK0118575574', 'TASK0118575574', 'TASK0118575574', 'TASK0118575585', 'TASK0118575585', 'TASK0118575585'],
        'RTSK Descr': ['Capacity Reservation', 'Capacity Reservation', 'Capacity Reservation', 'Add Monitoring - Unix', 'Add Monitoring - Unix', 'Add Monitoring - Unix'],
        'RTSK Worknote': [
            'Comment from automation',
            'Comment from manually written',
            '<h2>html content need parser</h2><p>This is HTML content.</p>',
            'Comment from automation',
            'Comment from manually written',
            '<h2>html content need parser</h2><p>This is HTML content.</p>'
        ]
    }
    
    df = pd.DataFrame(data)
    
    # Function to parse HTML content
    def parse_html(content):
        if '<' in content and '>' in content:
            soup = BeautifulSoup(content, 'html.parser')
            return soup.get_text(separator=" ", strip=True)
        return content
    
    # Function to filter unwanted phrases
    def filter_comments(text, unwanted_phrases):
        sentences = text.split('. ')
        filtered = [sentence for sentence in sentences if all(phrase not in sentence for phrase in unwanted_phrases)]
        return '. '.join(filtered)
    
    # Function to prioritize important phrases
    def prioritize_comments(text, keywords):
        sentences = text.split('. ')
        prioritized = [sentence for sentence in sentences if any(keyword in sentence for keyword in keywords)]
        remaining = [sentence for sentence in sentences if all(keyword not in sentence for keyword in keywords)]
        return '. '.join(prioritized + remaining)
    
    # Function to remove duplicate lines
    def remove_duplicates(text):
    
    sentences = text.split('\n')
    unique_sentences = list(dict.fromkeys(sentences))
    return '. '.join(unique_sentences)

    # Function to summarize text using Sumy def summarize_text(text, num_sentences=3): parser = PlaintextParser.from_string(text, Tokenizer("english")) summarizer = LsaSummarizer() summary = summarizer(parser.document, num_sentences) return ' '.join(str(sentence) for sentence in summary) # List of unwanted phrases unwanted_phrases = ["Integrator Record Response", "Function Name: UPDATE_SYSTEM"] keywords = ["Automation Failed"] # Group by RTSK Number grouped = df.groupby(['RTSK Number']) summaries = [] for name, group in grouped: combined_notes = ' '.join(group['RTSK Worknote'].apply(parse_html).tolist()) filtered_notes = filter_comments(combined_notes, unwanted_phrases) prioritized_notes = prioritize_comments(filtered_notes, keywords) summary = summarize_text(prioritized_notes, num_sentences=5) summaries.append({ 'RTSK Number': name, 'RTSK Descr': group['RTSK Descr'].iloc[0], 'Summary': summary, 'All Comments': '\n'.join(group['RTSK Worknote']) }) summary_df = pd.DataFrame(summaries) print(summary_df) # Save the summary to a CSV file summary_df.to_csv('summarized_worknotes.csv', index=False) print("Summarization complete. Check the summarized_worknotes.csv file.")

Explanation:

  1. HTML Parsing: The parse_html function uses BeautifulSoup to extract text from HTML content.

  2. Filter and Prioritize: The filter_comments function removes unwanted phrases, and the prioritize_comments function brings important phrases to the top.

  3. Summarization: The summarize_text function uses Sumy's LSA (Latent Semantic Analysis) summarizer to create summaries.

  4. Output: The script groups comments by RTSK number and provides structured summaries, ensuring all comments for each RTSK number are grouped together on separate lines.

This solution integrates everything needed to handle various comment formats and produce structured summaries. If you need more specific customizations or assistance, let me know!

No comments

Theme images by tjasam. Powered by Blogger.