HTML Parser Summarization
Certainly! Below is a Python script that uses the BeautifulSoup library to parse the HTML comments and sumy for summarization. This script will format and summarize the content in a structured way similar to what I provided earlier.
Step-by-Step Script
Install Necessary Libraries:
You can install the required libraries using
pip.shpip install beautifulsoup4 pip install sumy
Python Code:
pythonfrom bs4 import BeautifulSoup from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer from sumy.summarizers.lsa import LsaSummarizer html_content = ''' <h2>Automation Results</h2><br />====== =================<h3>Parameters:</h3><br/>dbaas_realm: GBLPRD<br />db_version: 19c<br/>target_operational_env: PRD<br />==========================<br/><br/><h3>Status of DB Reservation job: </h3><br/><br/><p><table border="1"> <tbody> <tr> <th>Job Name</th><th>Status</th> <th>OEM Job URL (SSO)</th> </ tr><tr><td>IPSOFT_SS_DBReservation_RITM0112482729_QK20Hzlok8</td><td style="color:#008000">SUCCEEDED </td><td><a href="https://dbaas-oem-prd.swissbank.com:7301/em/faces/core-jobs- procedure ExecutionTracking?execution GUID=22F1DA36F786B2C2E0630685380AC80A&instance GUID=22F1DA36F783B2C2E0630685380 AC80A&showProcActLink=yes">Link</a></td></tr></tbody> </table> </p>Rsvname: PDECOM6Q.PRD.GBL.UBS.NET Tier: BRONZE+ Primary Host 1: xldn30846por.ubsglobal-prod.msad.ubs.net Standby Host 1: xldn30821por.ubsglobal-prod.msad.ubs.net AutomationSuccessCode=![{"reservationName":"PDECOM6Q.PRD.GBL.UBS.NET", "primary_1":"xldn30846por.ubsglobal- prod.msad.ubs.net", "standby_a1":"xldn30821por.ubsglobal- prod.msad.ubs.net", "standby_a2":"n/a","primary_2":"n/a","standby_b1":"n/a","standby_b2":"n/a"}]! SUCCESS. Task currently being worked on by automation<br /><p style="margin-left: 40px"><table border="1"> <tbody> <tr> <th>Automation prod.ldn.swissbank.com/IPradar/update.htm?ticketID=172254162" target="_blank">172254162</a><br/></td> </ tr> <tr> <th>Execution</th> <td><a href="https://ipcenter-prod.ldn.swissbank.com/IPautomata/executionDetails.htm?executionID=136408170" D A A ''' def clean_html(html): soup = BeautifulSoup(html, 'html.parser') return soup.get_text(separator=" ", strip=True) def summarize_text(text, num_sentences=3): parser = PlaintextParser.from_string(text, Tokenizer("english")) summarizer = LsaSummarizer() summary = summarizer(parser.document, num_sentences) return ' '.join(str(sentence) for sentence in summary) raw_text = clean_html(html_content) summary = summarize_text(raw_text) print("Summary:") print(summary)
Explanation:
BeautifulSoup: This library is used to parse and clean the HTML content, extracting the text while removing HTML tags.
Sumy: This library performs the summarization. The
LsaSummarizeris used for summarizing the cleaned text.Integration: The script cleans the HTML content, summarizes the extracted text, and prints the summary.
Output:
When you run the script, it will produce a summarized version of the provided comments similar to the detailed summary I initially provided.
---------------------------------------------------------------------------------------------------------------------------
I understand you want to create a more structured and detailed summary, similar to the one I initially provided. To achieve this, we can manually parse the relevant sections and format them accordingly. Here’s how you can enhance your script to produce a summary with specific sections and formatted details:
Enhanced Script for Structured Summary
Install Necessary Libraries:
shpip install beautifulsoup4Python Code:
pythonfrom bs4 import BeautifulSoup html_content = ''' <h2>Automation Results</h2><br />====== =================<h3>Parameters:</h3><br/>dbaas_realm: GBLPRD<br />db_version: 19c<br/>target_operational_env: PRD<br />==========================<br/><br/><h3>Status of DB Reservation job: </h3><br/><br/><p><table border="1"> <tbody> <tr> <th>Job Name</th><th>Status</th> <th>OEM Job URL (SSO)</th> </ tr><tr><td>IPSOFT_SS_DBReservation_RITM0112482729_QK20Hzlok8</td><td style="color:#008000">SUCCEEDED </td><td><a href="https://dbaas-oem-prd.swissbank.com:7301/em/faces/core-jobs- procedure ExecutionTracking?execution GUID=22F1DA36F786B2C2E0630685380AC80A&instance GUID=22F1DA36F783B2C2E0630685380 AC80A&showProcActLink=yes">Link</a></td></tr></tbody> </table> </p>Rsvname: PDECOM6Q.PRD.GBL.UBS.NET Tier: BRONZE+ Primary Host 1: xldn30846por.ubsglobal-prod.msad.ubs.net Standby Host 1: xldn30821por.ubsglobal-prod.msad.ubs.net AutomationSuccessCode=![{"reservationName":"PDECOM6Q.PRD.GBL.UBS.NET", "primary_1":"xldn30846por.ubsglobal- prod.msad.ubs.net", "standby_a1":"xldn30821por.ubsglobal- prod.msad.ubs.net", "standby_a2":"n/a","primary_2":"n/a","standby_b1":"n/a","standby_b2":"n/a"}]! SUCCESS. Task currently being worked on by automation<br /><p style="margin-left: 40px"><table border="1"> <tbody> <tr> <th>Automation prod.ldn.swissbank.com/IPradar/update.htm?ticketID=172254162" target="_blank">172254162</a><br/></td> </ tr> <tr> <th>Execution</th> <td><a href="https://ipcenter-prod.ldn.swissbank.com/IPautomata/executionDetails.htm?executionID=136408170" D A A ''' def clean_html(html): soup = BeautifulSoup(html, 'html.parser') return soup.get_text(separator=" ", strip=True) def format_summary(cleaned_text): sections = cleaned_text.split(" ") summary = "" # Parameters parameters_index = sections.index("Parameters:") + 1 parameters = "Parameters:\n" while sections[parameters_index] != "Status": parameters += sections[parameters_index] + " " parameters_index += 1 summary += parameters.strip() + "\n\n" # Status of DB Reservation job status_index = sections.index("Status") + 1 status = "Status of DB Reservation job:\n" while sections[status_index] != "Rsvname:": status += sections[status_index] + " " status_index += 1 summary += status.strip() + "\n\n" # Reservation Details reservation_index = sections.index("Rsvname:") + 1 reservation = "Reservation Details:\n" while sections[reservation_index] != "AutomationSuccessCode": reservation += sections[reservation_index] + " " reservation_index += 1 summary += reservation.strip() + "\n\n" # Automation Success Code success_code_index = sections.index("AutomationSuccessCode") + 1 success_code = "Automation Success Code:\n" while sections[success_code_index] != "Task": success_code += sections[success_code_index] + " " success_code_index += 1 summary += success_code.strip() + "\n\n" # Current Task task_index = sections.index("Task") + 1 task = "Current Task:\n" while task_index < len(sections): task += sections[task_index] + " " task_index += 1 summary += task.strip() return summary cleaned_text = clean_html(html_content) structured_summary = format_summary(cleaned_text) print("Structured Summary:\n") print(structured_summary)
Explanation:
BeautifulSoup: Parses and cleans the HTML content.
Format Summary Function: Manually processes and structures the text into specific sections such as Parameters, Status of DB Reservation job, Reservation Details, Automation Success Code, and Current Task.
Structured Output: The script formats and prints a more human-readable summary similar to the one I provided.
No comments