Dynamic code summary2
To capture the issue keywords along with the date line following #WD, we need to adjust the regular expression to handle multiple lines and capture both the date line and the subsequent keywords. Here’s the updated code:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import re
# Assume 'dataset' holds the input data for this script
df = dataset
# Convert RTSK Worknote to string
df["RTSK Worknote"] = df["RTSK Worknote"].astype(str)
# Function to parse HTML content
def parse_html(content):
if '<' in content and '>' in content:
soup = BeautifulSoup(content, 'html.parser')
return soup.get_text(separator="", strip=True)
return content
# Function to extract error codes and issue keywords
def extract_issues(text):
# Regular expression to find patterns like #WD__ where __ are digits, followed by optional space/newline, date, and issue keywords
match = re.search(r'(#WD\s?\d{2,3})\s*([^\n]*)(\n.*)', text)
if match:
error_code = match.group(1).replace(" ", "") # Extract the error code and remove spaces
date_line = match.group(2) # Extract the date line
issue_keyword = match.group(3).strip() # Extract the issue keywords line and strip leading/trailing spaces
combined_issue_keyword = date_line + " " + issue_keyword # Combine date line and issue keywords
return pd.Series([error_code, combined_issue_keyword])
return pd.Series(["No error code", "No issue keyword"])
# Apply HTML parsing to RTSK Worknote
df['RTSK Worknote'] = df['RTSK Worknote'].apply(parse_html)
# Apply the extract_issues function to RTSK Worknote
df[['Error Code', 'Issue Keywords']] = df['RTSK Worknote'].apply(extract_issues)
# Display the updated DataFrame
print(df.head())
# The final dataset
dataset = df
Explanation:
re.search(r'(#WD\s?\d{2,3})\s*([^\n]*)(\n.*)', text): This regular expression captures:#WD\s?\d{2,3}: The error code pattern.\s*([^\n]*): The date line, allowing for optional spaces or newline characters.(\n.*): The issue keywords line.
match.group(1): Extracts the error code.match.group(2): Extracts the date line.match.group(3).strip(): Extracts the issue keywords line and removes leading/trailing spaces.Combines the date line and issue keywords: Creates a single string that includes both the date line and the issue keywords.
This updated code should correctly capture both the date line and the issue keywords line following #WD.
No comments