Dynamic code summary2

 To capture the issue keywords along with the date line following #WD, we need to adjust the regular expression to handle multiple lines and capture both the date line and the subsequent keywords. Here’s the updated code:

python
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import re

# Assume 'dataset' holds the input data for this script
df = dataset

# Convert RTSK Worknote to string
df["RTSK Worknote"] = df["RTSK Worknote"].astype(str)

# Function to parse HTML content
def parse_html(content):
    if '<' in content and '>' in content:
        soup = BeautifulSoup(content, 'html.parser')
        return soup.get_text(separator="", strip=True)
    return content

# Function to extract error codes and issue keywords
def extract_issues(text):
    # Regular expression to find patterns like #WD__ where __ are digits, followed by optional space/newline, date, and issue keywords
    match = re.search(r'(#WD\s?\d{2,3})\s*([^\n]*)(\n.*)', text)
    if match:
        error_code = match.group(1).replace(" ", "")  # Extract the error code and remove spaces
        date_line = match.group(2)  # Extract the date line
        issue_keyword = match.group(3).strip()  # Extract the issue keywords line and strip leading/trailing spaces
        combined_issue_keyword = date_line + " " + issue_keyword  # Combine date line and issue keywords
        return pd.Series([error_code, combined_issue_keyword])
    return pd.Series(["No error code", "No issue keyword"])

# Apply HTML parsing to RTSK Worknote
df['RTSK Worknote'] = df['RTSK Worknote'].apply(parse_html)

# Apply the extract_issues function to RTSK Worknote
df[['Error Code', 'Issue Keywords']] = df['RTSK Worknote'].apply(extract_issues)

# Display the updated DataFrame
print(df.head())

# The final dataset
dataset = df

Explanation:

  • re.search(r'(#WD\s?\d{2,3})\s*([^\n]*)(\n.*)', text): This regular expression captures:

    • #WD\s?\d{2,3}: The error code pattern.

    • \s*([^\n]*): The date line, allowing for optional spaces or newline characters.

    • (\n.*): The issue keywords line.

  • match.group(1): Extracts the error code.

  • match.group(2): Extracts the date line.

  • match.group(3).strip(): Extracts the issue keywords line and removes leading/trailing spaces.

  • Combines the date line and issue keywords: Creates a single string that includes both the date line and the issue keywords.

This updated code should correctly capture both the date line and the issue keywords line following #WD.

No comments

Theme images by tjasam. Powered by Blogger.