Dynamic code summary
Here’s the updated code:
python
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import re
# Assume 'dataset' holds the input data for this script
df = dataset
# Convert RTSK Worknote to string
df["RTSK Worknote"] = df["RTSK Worknote"].astype(str)
# Function to parse HTML content
def parse_html(content):
if '<' in content and '>' in content:
soup = BeautifulSoup(content, 'html.parser')
return soup.get_text(separator="", strip=True)
return content
# Function to extract error codes and issue keywords
def extract_issues(text):
# Regular expression to find patterns like #WD__ where __ are digits
match = re.search(r'(#WD\d{2,3})\s*(.*)', text)
if match:
error_code = match.group(1) # Extract the error code
issue_keyword = match.group(2) # Extract the issue keyword
return pd.Series([error_code, issue_keyword])
return pd.Series(["No error code", "No issue keyword"])
# Apply HTML parsing to RTSK Worknote
df['RTSK Worknote'] = df['RTSK Worknote'].apply(parse_html)
# Apply the extract_issues function to RTSK Worknote
df[['Error Code', 'Issue Keywords']] = df['RTSK Worknote'].apply(extract_issues)
# Display the updated DataFrame
print(df.head())
# The final dataset
dataset = df
Explanation:
Regular Expression (Regex):
r'(#WD\d{2,3})\s*(.*)'is used to identify the pattern#WD__where_represents 2 or 3 digit numbers, followed by any issue keywords.#WD\d{2,3}matches#WDfollowed by 2 or 3 digits.\s*matches any whitespace after the error code.(.*)captures the issue keywords that follow.
extract_issues Function:
Uses the regex to extract the error code and issue keywords.
Returns these values as a Series to be added as new columns in the DataFrame.
Applying the Function:
The
applyfunction is used to applyextract_issuesto eachRTSK Worknote.The results are stored in new columns
Error CodeandIssue Keywords
No comments