Summertime Summarization

 We'll follow these steps to achieve your goals:

  1. Load and Clean Data: Handle different comment formats, including parsing HTML content.

  2. Filter and Prioritize: Remove unwanted phrases and prioritize important phrases.

  3. Summarize Comments: Group by RTSK number and create summaries for each group.

  4. Structure Output: Ensure all comments for each RTSK number are grouped together on separate lines.

Here's a detailed Python script using pandas, BeautifulSoup for HTML parsing, and transformers for summarization:

Step-by-Step Script

  1. Install Necessary Libraries:

    sh
    pip install pandas
    pip install beautifulsoup4
    pip install transformers
    
  2. Python Code:

    python
    import pandas as pd
    from bs4 import BeautifulSoup
    from transformers import pipeline
    
    # Sample Data
    data = {
        'RTSK Number': ['TASK0118575574', 'TASK0118575574', 'TASK0118575574', 'TASK0118575585', 'TASK0118575585', 'TASK0118575585'],
        'RTSK Descr': ['Capacity Reservation', 'Capacity Reservation', 'Capacity Reservation', 'Add Monitoring - Unix', 'Add Monitoring - Unix', 'Add Monitoring - Unix'],
        'RTSK Worknote': [
            'Comment from automation',
            'Comment from manually written',
            '<h2>html content need parser</h2><p>This is HTML content.</p>',
            'Comment from automation',
            'Comment from manually written',
            '<h2>html content need parser</h2><p>This is HTML content.</p>'
        ]
    }
    
    df = pd.DataFrame(data)
    
    # Function to parse HTML content
    def parse_html(content):
        if '<' in content and '>' in content:
            soup = BeautifulSoup(content, 'html.parser')
            return soup.get_text(separator=" ", strip=True)
        return content
    
    # Function to filter unwanted phrases
    def filter_comments(text, unwanted_phrases):
        sentences = text.split('. ')
        filtered = [sentence for sentence in sentences if all(phrase not in sentence for phrase in unwanted_phrases)]
        return '. '.join(filtered)
    
    # Function to prioritize important phrases
    def prioritize_comments(text, keywords):
        sentences = text.split('. ')
        prioritized = [sentence for sentence in sentences if any(keyword in sentence for keyword in keywords)]
        remaining = [sentence for sentence in sentences if all(keyword not in sentence for keyword in keywords)]
        return '. '.join(prioritized + remaining)
    
    # Initialize summarization pipeline
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
    
    # List of unwanted phrases
    unwanted_phrases = ["Integrator Record Response", "Function Name: UPDATE_SYSTEM"]
    keywords = ["Automation Failed"]
    
    # Group by RTSK Number
    grouped = df.groupby(['RTSK Number'])
    
    summaries = []
    for name, group in grouped:
        combined_notes = ' '.join(group['RTSK Worknote'].apply(parse_html).tolist())
        filtered_notes = filter_comments(combined_notes, unwanted_phrases)
        prioritized_notes = prioritize_comments(filtered_notes, keywords)
        summary = summarizer(prioritized_notes, max_length=130, min_length=30, do_sample=False)[0]['summary_text']
        summaries.append({
            'RTSK Number': name,
            'RTSK Descr': group['RTSK Descr'].iloc[0],
            'Summary': summary,
            'All Comments': '\n'.join(group['RTSK Worknote'])
        })
    
    summary_df = pd.DataFrame(summaries)
    print(summary_df)
    
    # Save the summary to a CSV file
    summary_df.to_csv('summarized_worknotes.csv', index=False)
    print("Summarization complete. Check the summarized_worknotes.csv file.")
    

Explanation:

  1. Parse HTML Content: The parse_html function uses BeautifulSoup to extract text from HTML content.

  2. Filter and Prioritize: The filter_comments function removes unwanted phrases, and the prioritize_comments function brings important phrases to the top.

  3. Summarize: Using the transformers library's BART model, the script summarizes the grouped comments.

  4. Output: The script structures the summary and all comments, ensuring all comments for each RTSK number are grouped together on separate lines.

This solution integrates everything you need to handle various comment formats and produce structured summaries. If you have any more specific requirements or need further customization, let me know!

Transformers need to connect over internet on hugging face which is restricted in our organization, can you give other model which I can build without having to connect over internet?

Got it! If you need a summarization model that can be used offline, you can consider using SummerTime, an open-source text summarization toolkit. It supports various models like TextRank, BART, Longformer, and more. Here's how you can set it up:

Step-by-Step Script

  1. Install SummerTime:

    sh
    pip install summertime
    
  2. Python Code:

    python
    from summertime import model
    from summertime.models import BartModel, TextRankModel
    from summertime.utils import clean_text
    
    # Sample Data
    data = {
        'RTSK Number': ['TASK0118575574', 'TASK0118575574', 'TASK0118575574', 'TASK0118575585', 'TASK0118575585', 'TASK0118575585'],
        'RTSK Descr': ['Capacity Reservation', 'Capacity Reservation', 'Capacity Reservation', 'Add Monitoring - Unix', 'Add Monitoring - Unix', 'Add Monitoring - Unix'],
        'RTSK Worknote': [
            'Comment from automation',
            'Comment from manually written',
            '<h2>html content need parser</h2><p>This is HTML content.</p>',
            'Comment from automation',
            'Comment from manually written',
            '<h2>html content need parser</h2><p>This is HTML content.</p>'
        ]
    }
    
    df = pd.DataFrame(data)
    
    # Function to parse HTML content
    def parse_html(content):
        if '<' in content and '>' in content:
            soup = BeautifulSoup(content, 'html.parser')
            return soup.get_text(separator=" ", strip=True)
        return content
    
    # Initialize summarization model
    summarizer = BartModel()
    
    # Group by RTSK Number
    grouped = df.groupby(['RTSK Number'])
    
    summaries = []
    for name, group in grouped:
        combined_notes = ' '.join(group['RTSK Worknote'].apply(parse_html).tolist())
        cleaned_notes = clean_text(combined_notes)
        summary = summarizer.summarize(cleaned_notes, num_sentences=3)
        summaries.append({
            'RTSK Number': name,
            'RTSK Descr': group['RTSK Descr'].iloc[0],
            'Summary': summary,
            'All Comments': '\n'.join(group['RTSK Worknote'])
        })
    
    summary_df = pd.DataFrame(summaries)
    print(summary_df)
    
    # Save the summary to a CSV file
    summary_df.to_csv('summarized_worknotes.csv', index=False)
    print("Summarization complete. Check the summarized_worknotes.csv file.")
    

Explanation:

  1. SummerTime: This library provides various summarization models that can be used offline.

  2. HTML Parsing: The parse_html function extracts text from HTML content.

  3. Summarization: The BartModel from SummerTime is used to generate summaries.

  4. Output: The script groups comments by RTSK number and provides structured summaries.

This approach should work well within your organization's restrictions. If you need further customization or have any questions, feel free to ask!

Is summertime better model or sumy?

Both SummerTime and Sumy are excellent choices for text summarization, but they have different strengths:

SummerTime:

  • Model Variety: Supports a wide range of models, including BART, Longformer, and TextRank.

  • Ease of Use: Designed for non-experts with easy-to-use APIs and visualizations.

  • Flexibility: Can handle various summarization tasks like query-based, multi-document, and dialogue summarization.

  • Offline Use: Can be used offline without needing internet access.

Sumy:

  • Simplicity: Easier to set up and use for basic summarization tasks.

  • Lightweight: Smaller library with fewer dependencies.

  • Performance: Good for quick and straightforward summarization needs.

Which to Choose?

  • If you need a more comprehensive toolkit with a variety of models and tasks, SummerTime might be the better choice.

  • If you prefer a simpler, lightweight solution for basic summarization, Sumy could be more suitable.


---------------------------------------------------------------------------------------------------------------------

pip install torch==1.8.0+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

No comments

Theme images by tjasam. Powered by Blogger.