Summertime Summarization

We'll follow these steps to achieve your goals:

Load and Clean Data: Handle different comment formats, including parsing HTML content.
Filter and Prioritize: Remove unwanted phrases and prioritize important phrases.
Summarize Comments: Group by RTSK number and create summaries for each group.
Structure Output: Ensure all comments for each RTSK number are grouped together on separate lines.

Here's a detailed Python script using pandas, BeautifulSoup for HTML parsing, and transformers for summarization:

Step-by-Step Script

Install Necessary Libraries:

pip install pandas
pip install beautifulsoup4
pip install transformers

Python Code:

python

import pandas as pd
from bs4 import BeautifulSoup
from transformers import pipeline

# Sample Data
data = {
    'RTSK Number': ['TASK0118575574', 'TASK0118575574', 'TASK0118575574', 'TASK0118575585', 'TASK0118575585', 'TASK0118575585'],
    'RTSK Descr': ['Capacity Reservation', 'Capacity Reservation', 'Capacity Reservation', 'Add Monitoring - Unix', 'Add Monitoring - Unix', 'Add Monitoring - Unix'],
    'RTSK Worknote': [
        'Comment from automation',
        'Comment from manually written',
        '<h2>html content need parser</h2><p>This is HTML content.</p>',
        'Comment from automation',
        'Comment from manually written',
        '<h2>html content need parser</h2><p>This is HTML content.</p>'
    ]
}

df = pd.DataFrame(data)

# Function to parse HTML content
def parse_html(content):
    if '<' in content and '>' in content:
        soup = BeautifulSoup(content, 'html.parser')
        return soup.get_text(separator=" ", strip=True)
    return content

# Function to filter unwanted phrases
def filter_comments(text, unwanted_phrases):
    sentences = text.split('. ')
    filtered = [sentence for sentence in sentences if all(phrase not in sentence for phrase in unwanted_phrases)]
    return '. '.join(filtered)

# Function to prioritize important phrases
def prioritize_comments(text, keywords):
    sentences = text.split('. ')
    prioritized = [sentence for sentence in sentences if any(keyword in sentence for keyword in keywords)]
    remaining = [sentence for sentence in sentences if all(keyword not in sentence for keyword in keywords)]
    return '. '.join(prioritized + remaining)

# Initialize summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# List of unwanted phrases
unwanted_phrases = ["Integrator Record Response", "Function Name: UPDATE_SYSTEM"]
keywords = ["Automation Failed"]

# Group by RTSK Number
grouped = df.groupby(['RTSK Number'])

summaries = []
for name, group in grouped:
    combined_notes = ' '.join(group['RTSK Worknote'].apply(parse_html).tolist())
    filtered_notes = filter_comments(combined_notes, unwanted_phrases)
    prioritized_notes = prioritize_comments(filtered_notes, keywords)
    summary = summarizer(prioritized_notes, max_length=130, min_length=30, do_sample=False)[0]['summary_text']
    summaries.append({
        'RTSK Number': name,
        'RTSK Descr': group['RTSK Descr'].iloc[0],
        'Summary': summary,
        'All Comments': '\n'.join(group['RTSK Worknote'])
    })

summary_df = pd.DataFrame(summaries)
print(summary_df)

# Save the summary to a CSV file
summary_df.to_csv('summarized_worknotes.csv', index=False)
print("Summarization complete. Check the summarized_worknotes.csv file.")

Explanation:

Parse HTML Content: The parse_html function uses BeautifulSoup to extract text from HTML content.
Filter and Prioritize: The filter_comments function removes unwanted phrases, and the prioritize_comments function brings important phrases to the top.
Summarize: Using the transformers library's BART model, the script summarizes the grouped comments.
Output: The script structures the summary and all comments, ensuring all comments for each RTSK number are grouped together on separate lines.

This solution integrates everything you need to handle various comment formats and produce structured summaries. If you have any more specific requirements or need further customization, let me know!

Transformers need to connect over internet on hugging face which is restricted in our organization, can you give other model which I can build without having to connect over internet?

Got it! If you need a summarization model that can be used offline, you can consider using SummerTime, an open-source text summarization toolkit. It supports various models like TextRank, BART, Longformer, and more. Here's how you can set it up:

Step-by-Step Script

Install SummerTime:
sh
```
pip install summertime
```

Python Code:

python

from summertime import model
from summertime.models import BartModel, TextRankModel
from summertime.utils import clean_text

# Sample Data
data = {
    'RTSK Number': ['TASK0118575574', 'TASK0118575574', 'TASK0118575574', 'TASK0118575585', 'TASK0118575585', 'TASK0118575585'],
    'RTSK Descr': ['Capacity Reservation', 'Capacity Reservation', 'Capacity Reservation', 'Add Monitoring - Unix', 'Add Monitoring - Unix', 'Add Monitoring - Unix'],
    'RTSK Worknote': [
        'Comment from automation',
        'Comment from manually written',
        '<h2>html content need parser</h2><p>This is HTML content.</p>',
        'Comment from automation',
        'Comment from manually written',
        '<h2>html content need parser</h2><p>This is HTML content.</p>'
    ]
}

df = pd.DataFrame(data)

# Function to parse HTML content
def parse_html(content):
    if '<' in content and '>' in content:
        soup = BeautifulSoup(content, 'html.parser')
        return soup.get_text(separator=" ", strip=True)
    return content

# Initialize summarization model
summarizer = BartModel()

# Group by RTSK Number
grouped = df.groupby(['RTSK Number'])

summaries = []
for name, group in grouped:
    combined_notes = ' '.join(group['RTSK Worknote'].apply(parse_html).tolist())
    cleaned_notes = clean_text(combined_notes)
    summary = summarizer.summarize(cleaned_notes, num_sentences=3)
    summaries.append({
        'RTSK Number': name,
        'RTSK Descr': group['RTSK Descr'].iloc[0],
        'Summary': summary,
        'All Comments': '\n'.join(group['RTSK Worknote'])
    })

summary_df = pd.DataFrame(summaries)
print(summary_df)

# Save the summary to a CSV file
summary_df.to_csv('summarized_worknotes.csv', index=False)
print("Summarization complete. Check the summarized_worknotes.csv file.")

Explanation:

SummerTime: This library provides various summarization models that can be used offline.
HTML Parsing: The parse_html function extracts text from HTML content.
Summarization: The BartModel from SummerTime is used to generate summaries.
Output: The script groups comments by RTSK number and provides structured summaries.

This approach should work well within your organization's restrictions. If you need further customization or have any questions, feel free to ask!

Is summertime better model or sumy?

Both SummerTime and Sumy are excellent choices for text summarization, but they have different strengths:

SummerTime:

Model Variety: Supports a wide range of models, including BART, Longformer, and TextRank.
Ease of Use: Designed for non-experts with easy-to-use APIs and visualizations.
Flexibility: Can handle various summarization tasks like query-based, multi-document, and dialogue summarization.
Offline Use: Can be used offline without needing internet access.

Sumy:

Simplicity: Easier to set up and use for basic summarization tasks.
Lightweight: Smaller library with fewer dependencies.
Performance: Good for quick and straightforward summarization needs.

Which to Choose?

If you need a more comprehensive toolkit with a variety of models and tasks, SummerTime might be the better choice.
If you prefer a simpler, lightweight solution for basic summarization, Sumy could be more suitable.

---------------------------------------------------------------------------------------------------------------------

pip install torch==1.8.0+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

Data Science & Machine Learning

Summertime Summarization

Step-by-Step Script

Explanation:

Step-by-Step Script

Explanation:

SummerTime:

Sumy:

Which to Choose?

No comments

Popular

Recent

Comments

Search This Blog

Blog Archive

About Us

Recent in Spirituality

Recent in Tourism

Popular Posts