Can we crawl the K-Business Portal?

힘센캥거루
2025년 10월 13일(수정됨)
3
56

Recently, I noticed that the K-Business Portal displays official document titles in segments of three months.

What if we could collect all document titles through crawling and compile three years' worth into an Excel file?

It seemed like using a filter to search document numbers would be easy.

To cut to the chase, it was impossible.

Can we crawl the K-Business Portal?-1

1. Everyone has a plan.

My grand plan was this:

1. Access and log in to the business portal using Python's selenium.
2. Use xpath to collect tr and td within table tags and convert them into a data frame.
3. Use datetime and delta to calculate dates and collect all documents every three months.
4. Use them appropriately in Excel or Google Sheets.

However, when I actually accessed the business portal, I couldn't find any tables or even iframes.

After fumbling around for four hours, here's why crawling was impossible.

2. What about your security?

1. WebDRM

Can we crawl the K-Business Portal?-2

The first barrier was WebDRM.

This prevented me from enabling developer mode.

However, I tried utilizing the fact that if developer mode was already on, it remained even when accessing the site.

I eventually succeeded in enabling developer mode.

2. It appears in the browser, but the elements are absent?

This part was the hardest to understand.

Although elements were clearly visible in developer mode, they didn't appear when using page_source with the ChromeDriver.

This was the same whether using selenium or puppeteer.

What's even more intriguing is that they don't appear when using JavaScript in the console.

Seeing but not finding, this tantalizing sense drove me even crazier.

const puppeteer = require('puppeteer');
const fs = require('fs');
// Script to fetch the HTML of a web page using puppeteer

(async () => {
  let browser;
  try {
    browser = await puppeteer.launch({
      headless: false,
      args: ['--no-sandbox', '--disable-setuid-sandbox'],
      defaultViewport: null,
      userDataDir: './user_data',
    });

    const page = await browser.newPage();

    await page.goto('https://klef.goe.go.kr/keris_ui/main.do', {
      waitUntil: 'networkidle0', // Wait until all resources are loaded
      timeout: 60000, // Wait for up to 60 seconds
    });

    const html = await page.content();
    fs.writeFileSync('schoolDoc.html', html, 'utf8');
    console.log('The HTML file has been saved as schoolDoc.html.');

  } catch (error) {
    console.error('An error occurred:', error);
  } finally {
    if (browser) await browser.close();
  }
})();

Opening the saved file showed the following:

The desired table wasn't there, only the login module.

Can we crawl the K-Business Portal?-3

Thinking about it, I wondered if the table displaying documents was a standalone installed program.

This thought made it easier to give up.

3. The business portal uses WebSockets.

First, the business portal doesn't send data through a typical REST API but uses WebSockets for data exchange with users.

This fact was quite interesting.

Can we crawl the K-Business Portal?-4

Even trying to copy cookies with other libraries like requests for a new connection failed due to the WebSocket handshake and session maintenance.

But I couldn't give up there.

I decided to inspect requests and responses made through WebSockets using selenium-wire.

# Iterate through requests
for request in driver.requests:
    if request.response:
        content_type = request.response.headers.get('Content-Type', '')
        if content_type.startswith('application/json'):
            print("== Request URL:", request.url)
            try:
                body = request.response.body.decode('utf-8', errors='ignore')
                print("== Response Body:", body)
            except Exception as e:
                print("== Decoding Error:", e)

driver.quit()

It resulted in alphabetic encodings that seemed like base64.

Trying to decode them failed due to a mismatch in specifications.

That's when I decided to stop going further in.

3. Reflections

I was annoyed by the slowness of public institution sites, but after digging, I realized their security seems solid.

Obtaining even a single piece of data remotely wasn't easy, but I learned a lot.

However, I think I'll try again sometime.

Can we crawl the K-Business Portal?-5

관련 글

Automating School Work – Using AI to Check Subject-Specific Remarks in Student Records
Automating School Work – Using AI to Check Subject-Specific Remarks in Student Records
If I had to pick the most meaningless, exhausting, and boring task at school, I would choose checking student records.In middle school, the student re...
Book Review and Challenge Review of Chapter 7 of *Building an LLM from Scratch*
Book Review and Challenge Review of Chapter 7 of *Building an LLM from Scratch*
Chapter 7 covers the process of fine-tuning a model to follow instructions.In other words, making it give the desired response to a given question.As...
Review of Chapter 6 of *Build an LLM from Scratch*
Review of Chapter 6 of *Build an LLM from Scratch*
Chapter 6 is about fine-tuning for classification.The example used is building a spam classifier.A spam classifier determines whether something is spa...
Review of Chapter 5 of *Building an LLM from Scratch*
Review of Chapter 5 of *Building an LLM from Scratch*
Today is December 14.The challenge period actually ended two weeks ago, but I couldn’t just give up on writing a review.Because these TILs I leave lik...
Impressions After Reading Chapter 4 of “LLM From Scratch”
Impressions After Reading Chapter 4 of “LLM From Scratch”
Today is November 26, so if I finish one chapter a day, I’ll complete the challenge.I’m not sure if I can do it with my first and second kids constant...
Review of Chapter 3 of Learning LLM from Scratch
Review of Chapter 3 of Learning LLM from Scratch
After spilling a bucket of water on my MacBook, I was in shock and wasted about 3-4 days. In retrospect, since my MacBook was already damaged, I should have thought of it as being sent for repair and done something. Anyway, although it's a bit late, I am determined to see it through and leave a review of Chapter 3. 1. Attention Mechanism Chapter 3...

댓글을 불러오는 중...