HTML Parser Summarization

Certainly! Below is a Python script that uses the BeautifulSoup library to parse the HTML comments and sumy for summarization. This script will format and summarize the content in a structured way similar to what I provided earlier.

Step-by-Step Script

Install Necessary Libraries:
- You can install the required libraries using pip.
  sh
  pip install beautifulsoup4 pip install sumy

Python Code:

python

from bs4 import BeautifulSoup
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

html_content = '''
<h2>Automation Results</h2><br />====== =================<h3>Parameters:</h3><br/>dbaas_realm: GBLPRD<br
/>db_version: 19c<br/>target_operational_env: PRD<br />==========================<br/><br/><h3>Status of DB Reservation job:
</h3><br/><br/><p><table border="1"> <tbody> <tr> <th>Job Name</th><th>Status</th> <th>OEM Job URL (SSO)</th> </
tr><tr><td>IPSOFT_SS_DBReservation_RITM0112482729_QK20Hzlok8</td><td style="color:#008000">SUCCEEDED
</td><td><a href="https://dbaas-oem-prd.swissbank.com:7301/em/faces/core-jobs-
procedure ExecutionTracking?execution GUID=22F1DA36F786B2C2E0630685380AC80A&instance GUID=22F1DA36F783B2C2E0630685380
AC80A&showProcActLink=yes">Link</a></td></tr></tbody> </table> </p>Rsvname: PDECOM6Q.PRD.GBL.UBS.NET
Tier: BRONZE+
Primary Host 1: xldn30846por.ubsglobal-prod.msad.ubs.net
Standby Host 1: xldn30821por.ubsglobal-prod.msad.ubs.net
AutomationSuccessCode=![{"reservationName":"PDECOM6Q.PRD.GBL.UBS.NET", "primary_1":"xldn30846por.ubsglobal-
prod.msad.ubs.net", "standby_a1":"xldn30821por.ubsglobal-
prod.msad.ubs.net", "standby_a2":"n/a","primary_2":"n/a","standby_b1":"n/a","standby_b2":"n/a"}]!
SUCCESS.
Task currently being worked on by automation<br /><p style="margin-left: 40px"><table border="1"> <tbody> <tr> <th>Automation
prod.ldn.swissbank.com/IPradar/update.htm?ticketID=172254162" target="_blank">172254162</a><br/></td> </ tr> <tr>
<th>Execution</th> <td><a href="https://ipcenter-prod.ldn.swissbank.com/IPautomata/executionDetails.htm?executionID=136408170"
D
A
A
'''

def clean_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    return soup.get_text(separator=" ", strip=True)

def summarize_text(text, num_sentences=3):
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    summarizer = LsaSummarizer()
    summary = summarizer(parser.document, num_sentences)
    return ' '.join(str(sentence) for sentence in summary)

raw_text = clean_html(html_content)
summary = summarize_text(raw_text)

print("Summary:")
print(summary)

Explanation:

BeautifulSoup: This library is used to parse and clean the HTML content, extracting the text while removing HTML tags.
Sumy: This library performs the summarization. The LsaSummarizer is used for summarizing the cleaned text.
Integration: The script cleans the HTML content, summarizes the extracted text, and prints the summary.

Output:

When you run the script, it will produce a summarized version of the provided comments similar to the detailed summary I initially provided.

---------------------------------------------------------------------------------------------------------------------------

I understand you want to create a more structured and detailed summary, similar to the one I initially provided. To achieve this, we can manually parse the relevant sections and format them accordingly. Here’s how you can enhance your script to produce a summary with specific sections and formatted details:

Enhanced Script for Structured Summary

Install Necessary Libraries:
sh
```
pip install beautifulsoup4
```

Python Code:

python

from bs4 import BeautifulSoup

html_content = '''
<h2>Automation Results</h2><br />====== =================<h3>Parameters:</h3><br/>dbaas_realm: GBLPRD<br
/>db_version: 19c<br/>target_operational_env: PRD<br />==========================<br/><br/><h3>Status of DB Reservation job:
</h3><br/><br/><p><table border="1"> <tbody> <tr> <th>Job Name</th><th>Status</th> <th>OEM Job URL (SSO)</th> </
tr><tr><td>IPSOFT_SS_DBReservation_RITM0112482729_QK20Hzlok8</td><td style="color:#008000">SUCCEEDED
</td><td><a href="https://dbaas-oem-prd.swissbank.com:7301/em/faces/core-jobs-
procedure ExecutionTracking?execution GUID=22F1DA36F786B2C2E0630685380AC80A&instance GUID=22F1DA36F783B2C2E0630685380
AC80A&showProcActLink=yes">Link</a></td></tr></tbody> </table> </p>Rsvname: PDECOM6Q.PRD.GBL.UBS.NET
Tier: BRONZE+
Primary Host 1: xldn30846por.ubsglobal-prod.msad.ubs.net
Standby Host 1: xldn30821por.ubsglobal-prod.msad.ubs.net
AutomationSuccessCode=![{"reservationName":"PDECOM6Q.PRD.GBL.UBS.NET", "primary_1":"xldn30846por.ubsglobal-
prod.msad.ubs.net", "standby_a1":"xldn30821por.ubsglobal-
prod.msad.ubs.net", "standby_a2":"n/a","primary_2":"n/a","standby_b1":"n/a","standby_b2":"n/a"}]!
SUCCESS.
Task currently being worked on by automation<br /><p style="margin-left: 40px"><table border="1"> <tbody> <tr> <th>Automation
prod.ldn.swissbank.com/IPradar/update.htm?ticketID=172254162" target="_blank">172254162</a><br/></td> </ tr> <tr>
<th>Execution</th> <td><a href="https://ipcenter-prod.ldn.swissbank.com/IPautomata/executionDetails.htm?executionID=136408170"
D
A
A
'''

def clean_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    return soup.get_text(separator=" ", strip=True)

def format_summary(cleaned_text):
    sections = cleaned_text.split(" ")
    summary = ""
    
    # Parameters
    parameters_index = sections.index("Parameters:") + 1
    parameters = "Parameters:\n"
    while sections[parameters_index] != "Status":
        parameters += sections[parameters_index] + " "
        parameters_index += 1
    summary += parameters.strip() + "\n\n"
    
    # Status of DB Reservation job
    status_index = sections.index("Status") + 1
    status = "Status of DB Reservation job:\n"
    while sections[status_index] != "Rsvname:":
        status += sections[status_index] + " "
        status_index += 1
    summary += status.strip() + "\n\n"
    
    # Reservation Details
    reservation_index = sections.index("Rsvname:") + 1
    reservation = "Reservation Details:\n"
    while sections[reservation_index] != "AutomationSuccessCode":
        reservation += sections[reservation_index] + " "
        reservation_index += 1
    summary += reservation.strip() + "\n\n"
    
    # Automation Success Code
    success_code_index = sections.index("AutomationSuccessCode") + 1
    success_code = "Automation Success Code:\n"
    while sections[success_code_index] != "Task":
        success_code += sections[success_code_index] + " "
        success_code_index += 1
    summary += success_code.strip() + "\n\n"
    
    # Current Task
    task_index = sections.index("Task") + 1
    task = "Current Task:\n"
    while task_index < len(sections):
        task += sections[task_index] + " "
        task_index += 1
    summary += task.strip()
    
    return summary

cleaned_text = clean_html(html_content)
structured_summary = format_summary(cleaned_text)

print("Structured Summary:\n")
print(structured_summary)

Explanation:

BeautifulSoup: Parses and cleans the HTML content.
Format Summary Function: Manually processes and structures the text into specific sections such as Parameters, Status of DB Reservation job, Reservation Details, Automation Success Code, and Current Task.
Structured Output: The script formats and prints a more human-readable summary similar to the one I provided.

Data Science & Machine Learning

HTML Parser Summarization

Step-by-Step Script

Explanation:

Output:

Enhanced Script for Structured Summary

Explanation:

No comments

Popular

Recent

Comments

Search This Blog

Blog Archive

About Us

Recent in Spirituality

Recent in Tourism

Popular Posts